How do you pick these models?

These are the models I actually reach for when building AI agents for clients or my own projects. The ranking reflects production experience: tool-use reliability, latency on agent loops, cost per run, and how often each model produces a usable agent without babysitting. I update this quarterly.

Why no [popular model] in your top picks?

If a model I use heavily isn't listed, it's either not a good fit for AI agent development specifically (great at chat, weak at tool use), or I haven't shipped with it long enough to have a confident opinion. Arena-style leaderboards measure chat quality. I measure agent-loop reliability.

How often do you update these picks?

Quarterly. Each section shows a last-updated date. The picks only change when a model demonstrably beats the current top pick on agent-loop tasks over 2+ weeks of real shipping — not based on leaderboard rankings or marketing launches.

Do you ever switch mid-project?

Rarely. Switching LLMs mid-project invalidates prompts, evaluation snapshots, and cost baselines. Once a project is in production, I ride the same model until the project ends, even if a new release lands.

Cost matters but it isn't the top criterion. A model that's 50% cheaper but fails 30% more often costs more in retry loops, dev time, and user trust. I include cheaper models as #2/#3 picks where the failure-cost tradeoff is acceptable.

PICKS

My AI model picks

First-person picks from someone who ships AI agents for a living. These are the models I actually use, ranked by what I'd reach for first when starting a new project. Not a leaderboard mirror. Last updated June 2026.

Reading guide. #1 is the model I'd reach for first. #2 is the fallback when #1 is rate-limited, blocked, or stylistically wrong for the task. #3 is a specialist pick for a specific scenario (cost-sensitive, latency-sensitive, or a different "shape" of capability). The picks are for AI agent development specifically — tool use, long context, code generation, structured output, retry resilience. If you want a general-purpose chat ranking, the LMArena leaderboard and BridgeBench are better signals.

Text models (LLMs)

What I use for the agent loop — picking tools, calling functions, recovering from errors, structured output.

Claude Opus 4.7 (Anthropic)

My default for 80% of agent builds.

The most reliable tool-use model I've shipped with. Handles long agent loops (50+ steps) without drifting from the system prompt. Best at following complex multi-tool workflows. Expensive but the retry rate is the lowest of any model I test. The thinking variant is the right pick for agents that need to plan before acting.

Used in: The Vertical Agent Method, AI agent deployment guide. Cost: ~₹80 per typical agent run.

GPT-5.4 (OpenAI)

The fallback when Claude is rate-limited or wrong style.

Best at creative-style prompts where Claude is too literal. Strong on webdev-style tasks and ad-hoc code generation. The OpenAI function-calling API is the most documented, so it wins for client projects where the team has OpenAI muscle memory. Slightly higher retry rate than Claude on long multi-tool loops.

Used in: OpenAI function calling tutorial, CrewAI vs LangGraph. Cost: ~₹70 per typical agent run.

DeepSeek V3.2 (deepseek)

The cost-sensitive pick. Open weights, runs anywhere.

When the agent will run at high volume and the per-run cost matters more than absolute quality. Open weights mean I can self-host for client deployments that have data-residency requirements. Reasoning quality is close to GPT-5.4 on math, weaker on creative writing. Tool-use reliability is a step below Claude and GPT.

Used in: AI agent cost optimization. Cost: ~₹12 per typical agent run (API), or self-hosted.

Embeddings (RAG / vector search)

What I use to embed documents for retrieval-augmented agents and semantic search.

voyage-3 (Voyage AI)

Best retrieval quality per dollar for English+code.

Outperforms OpenAI's text-embedding-3-large on most retrieval benchmarks I've run, including on code snippets and technical docs. Specifically tuned for retrieval, not just generic embeddings. Cheap enough to re-embed on every schema change.

Used in: AI customer support agent. Cost: ~$0.06 per million tokens.

text-embedding-3-large (OpenAI)

The safe default. Everywhere, predictable, well-documented.

Wins when the project needs "any embedding that works" and the team is already on OpenAI. Strong general-purpose quality. Use this when voyage isn't an option (e.g., the client has an Azure OpenAI contract and voyage isn't approved).

Cost: ~$0.13 per million tokens.

Image models (when an agent needs to see)

For agents that take screenshots, read diagrams, or need vision understanding — not for generating images.

Claude Opus 4.7 vision (Anthropic)

Best at reading screenshots, diagrams, and UI mockups.

When the agent needs to look at a screenshot of a UI and reason about what's on screen, Claude is the most accurate. Used in agents that automate browser-based workflows or read PDF reports. Native to the same model I use for the agent loop, so no second vendor relationship.

Used in: browser-automation agents and document-parsing agents.

GPT-5.4 vision (OpenAI)

Stronger on OCR-heavy and table-extraction tasks.

When the agent's job is mostly reading dense text out of images (OCR, table extraction, form parsing), GPT-5.4 vision edges ahead. Weaker on UI/UX understanding than Claude.

What I don't pick (and why)

Categories I deliberately don't rank. Picking a model I don't ship with would be guessing, and a leaderboard is a worse version of an arena-style ranking.

Image generation (FLUX, DALL-E, Imagen, Midjourney). I don't ship image-generation agents. The arena.ai image leaderboard is a better signal than my guess would be.
Video generation (Sora, Veo, Runway). Same reason. Not my use case.
Audio / speech (Whisper, ElevenLabs, etc.). Whisper is the only one I've shipped with, and it's so dominant in its category that ranking alternatives would be theatre.
Open-weights reasoning models (Qwen, Kimi, Llama 4, etc.). I use them, but for specific narrow tasks (mostly as the #3 deepseek category covers). A full open-weights ranking is its own page.

External leaderboards I trust

For categories I don't cover, or for cross-checking my picks against broader signal, these are the leaderboards I actually open.

LMArena (arena.ai) — general chat quality across text, code, image, video. Best signal for "what does the average user prefer right now." Doesn't measure agent-loop reliability.
BridgeBench — AI coding and vibe-coding benchmark. 130+ real-world coding tasks across 6 categories. Best signal for "which model writes the best code" specifically.
Vellum LLM leaderboard — broader comparison across reasoning, coding, and price/performance tradeoffs. Updated frequently.

Want the full stack, not just the models? See what tools and infrastructure I use to ship agents.

→ Tools I use to build agents