PICKS

My AI model picks

First-person picks from someone who ships AI agents for a living. These are the models I actually use, ranked by what I'd reach for first when starting a new project. Not a leaderboard mirror. Last updated .

Reading guide. #1 is the model I'd reach for first. #2 is the fallback when #1 is rate-limited, blocked, or stylistically wrong for the task. #3 is a specialist pick for a specific scenario (cost-sensitive, latency-sensitive, or a different "shape" of capability). The picks are for AI agent development specifically — tool use, long context, code generation, structured output, retry resilience. If you want a general-purpose chat ranking, the LMArena leaderboard and BridgeBench are better signals.

Text models (LLMs)

#1

Claude Opus 4.7 (Anthropic)

My default for 80% of agent builds.

The most reliable tool-use model I've shipped with. Handles long agent loops (50+ steps) without drifting from the system prompt. Best at following complex multi-tool workflows. Expensive but the retry rate is the lowest of any model I test. The thinking variant is the right pick for agents that need to plan before acting.

Used in: The Vertical Agent Method, AI agent deployment guide. Cost: ~₹80 per typical agent run.

#2

GPT-5.4 (OpenAI)

The fallback when Claude is rate-limited or wrong style.

Best at creative-style prompts where Claude is too literal. Strong on webdev-style tasks and ad-hoc code generation. The OpenAI function-calling API is the most documented, so it wins for client projects where the team has OpenAI muscle memory. Slightly higher retry rate than Claude on long multi-tool loops.

Used in: OpenAI function calling tutorial, CrewAI vs LangGraph. Cost: ~₹70 per typical agent run.

#3

DeepSeek V3.2 (deepseek)

The cost-sensitive pick. Open weights, runs anywhere.

When the agent will run at high volume and the per-run cost matters more than absolute quality. Open weights mean I can self-host for client deployments that have data-residency requirements. Reasoning quality is close to GPT-5.4 on math, weaker on creative writing. Tool-use reliability is a step below Claude and GPT.

Used in: AI agent cost optimization. Cost: ~₹12 per typical agent run (API), or self-hosted.

Embeddings (RAG / vector search)

#1

voyage-3 (Voyage AI)

Best retrieval quality per dollar for English+code.

Outperforms OpenAI's text-embedding-3-large on most retrieval benchmarks I've run, including on code snippets and technical docs. Specifically tuned for retrieval, not just generic embeddings. Cheap enough to re-embed on every schema change.

Used in: AI customer support agent. Cost: ~$0.06 per million tokens.

#2

text-embedding-3-large (OpenAI)

The safe default. Everywhere, predictable, well-documented.

Wins when the project needs "any embedding that works" and the team is already on OpenAI. Strong general-purpose quality. Use this when voyage isn't an option (e.g., the client has an Azure OpenAI contract and voyage isn't approved).

Cost: ~$0.13 per million tokens.

Image models (when an agent needs to see)

#1

Claude Opus 4.7 vision (Anthropic)

Best at reading screenshots, diagrams, and UI mockups.

When the agent needs to look at a screenshot of a UI and reason about what's on screen, Claude is the most accurate. Used in agents that automate browser-based workflows or read PDF reports. Native to the same model I use for the agent loop, so no second vendor relationship.

Used in: browser-automation agents and document-parsing agents.

#2

GPT-5.4 vision (OpenAI)

Stronger on OCR-heavy and table-extraction tasks.

When the agent's job is mostly reading dense text out of images (OCR, table extraction, form parsing), GPT-5.4 vision edges ahead. Weaker on UI/UX understanding than Claude.

What I don't pick (and why)

External leaderboards I trust

Want the full stack, not just the models? See what tools and infrastructure I use to ship agents.

→ Tools I use to build agents
Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.