My AI model picks
First-person picks from someone who ships AI agents for a living. These are the models I actually use, ranked by what I'd reach for first when starting a new project. Not a leaderboard mirror. Last updated .
Reading guide. #1 is the model I'd reach for first. #2 is the fallback when #1 is rate-limited, blocked, or stylistically wrong for the task. #3 is a specialist pick for a specific scenario (cost-sensitive, latency-sensitive, or a different "shape" of capability). The picks are for AI agent development specifically — tool use, long context, code generation, structured output, retry resilience. If you want a general-purpose chat ranking, the LMArena leaderboard and BridgeBench are better signals.
Text models (LLMs)
Claude Opus 4.7 (Anthropic)
My default for 80% of agent builds.
The most reliable tool-use model I've shipped with. Handles long agent loops (50+ steps) without drifting from the system prompt. Best at following complex multi-tool workflows. Expensive but the retry rate is the lowest of any model I test. The thinking variant is the right pick for agents that need to plan before acting.
GPT-5.4 (OpenAI)
The fallback when Claude is rate-limited or wrong style.
Best at creative-style prompts where Claude is too literal. Strong on webdev-style tasks and ad-hoc code generation. The OpenAI function-calling API is the most documented, so it wins for client projects where the team has OpenAI muscle memory. Slightly higher retry rate than Claude on long multi-tool loops.
DeepSeek V3.2 (deepseek)
The cost-sensitive pick. Open weights, runs anywhere.
When the agent will run at high volume and the per-run cost matters more than absolute quality. Open weights mean I can self-host for client deployments that have data-residency requirements. Reasoning quality is close to GPT-5.4 on math, weaker on creative writing. Tool-use reliability is a step below Claude and GPT.
Embeddings (RAG / vector search)
voyage-3 (Voyage AI)
Best retrieval quality per dollar for English+code.
Outperforms OpenAI's text-embedding-3-large on most retrieval benchmarks I've run, including on code snippets and technical docs. Specifically tuned for retrieval, not just generic embeddings. Cheap enough to re-embed on every schema change.
text-embedding-3-large (OpenAI)
The safe default. Everywhere, predictable, well-documented.
Wins when the project needs "any embedding that works" and the team is already on OpenAI. Strong general-purpose quality. Use this when voyage isn't an option (e.g., the client has an Azure OpenAI contract and voyage isn't approved).
Image models (when an agent needs to see)
Claude Opus 4.7 vision (Anthropic)
Best at reading screenshots, diagrams, and UI mockups.
When the agent needs to look at a screenshot of a UI and reason about what's on screen, Claude is the most accurate. Used in agents that automate browser-based workflows or read PDF reports. Native to the same model I use for the agent loop, so no second vendor relationship.
GPT-5.4 vision (OpenAI)
Stronger on OCR-heavy and table-extraction tasks.
When the agent's job is mostly reading dense text out of images (OCR, table extraction, form parsing), GPT-5.4 vision edges ahead. Weaker on UI/UX understanding than Claude.
What I don't pick (and why)
- Image generation (FLUX, DALL-E, Imagen, Midjourney). I don't ship image-generation agents. The arena.ai image leaderboard is a better signal than my guess would be.
- Video generation (Sora, Veo, Runway). Same reason. Not my use case.
- Audio / speech (Whisper, ElevenLabs, etc.). Whisper is the only one I've shipped with, and it's so dominant in its category that ranking alternatives would be theatre.
- Open-weights reasoning models (Qwen, Kimi, Llama 4, etc.). I use them, but for specific narrow tasks (mostly as the #3 deepseek category covers). A full open-weights ranking is its own page.
External leaderboards I trust
- LMArena (arena.ai) — general chat quality across text, code, image, video. Best signal for "what does the average user prefer right now." Doesn't measure agent-loop reliability.
- BridgeBench — AI coding and vibe-coding benchmark. 130+ real-world coding tasks across 6 categories. Best signal for "which model writes the best code" specifically.
- Vellum LLM leaderboard — broader comparison across reasoning, coding, and price/performance tradeoffs. Updated frequently.
Want the full stack, not just the models? See what tools and infrastructure I use to ship agents.
→ Tools I use to build agents