OpenAI function calling tutorial: building tools for GPT
A complete guide to OpenAI function calling — defining tools, handling parallel calls, streaming, and building a tool-using agent from scratch.
TL;DR: Function calling turns a chat model into something that can actually do things — query databases, call APIs, and compute results. This guide covers defining tools with JSON Schema, handling parallel calls, streaming with tool call deltas, and building a complete agent loop in 80 lines of Python with no frameworks.
Function calling is the single most important primitive in building AI agents. It’s what turns a chat model from a text generator into something that can actually do things — query databases, call APIs, send emails, compute results.
I’ve built agents using both OpenAI’s and Anthropic’s tool use APIs. Here’s my complete guide to OpenAI function calling, built from production experience rather than documentation examples.
Key takeaways:
- Function calling lets the model request structured function execution — it doesn’t execute functions itself, it asks you to do it
- Define tools as JSON Schema objects in the
toolsparameter alongside messages- Parallel function calling means the model can request multiple tools in a single response — handle them all before returning results
- Streaming with function calls works by collecting partial
tool_callsdelta chunks by index- A complete agent loop needs just OpenAI’s SDK — no frameworks required
OpenAI’s function calling documentation defines the standard for tool-use APIs — models that accept structured tool definitions and return callable function invocations. This is the most widely adopted tool-use format in the industry.
What function calling actually is
The name is misleading. OpenAI’s function calling doesn’t mean the model calls functions on your computer. The model outputs a structured request that says “I want to call this function with these arguments.” Your code decides whether to execute it.
The flow looks like this:
User: "What's the weather in Bengaluru?"
Model: "I should check the weather API."
↓
Model outputs: { "function": "get_weather", "args": { "location": "Bengaluru" } }
↓
Your code executes get_weather("Bengaluru") → "26°C, partly cloudy"
↓
You send the result back to the model
Model: "The weather in Bengaluru is 26°C and partly cloudy."
The model never touches your API keys, never executes code on your server. It just requests tool execution. You control what runs.
Defining tools
Tools are defined as JSON Schema objects. Each tool has a name, description, and parameters schema. The description is critical — it’s how the model knows when to call the tool.
import openai
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given location. Returns temperature, conditions, humidity, and wind speed.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. 'Bengaluru, India' or 'San Francisco, CA'"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units. Defaults to celsius for India, fahrenheit for US."
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "get_air_quality",
"description": "Get air quality index and PM2.5 data for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
]
Rule of thumb for descriptions: Describe when to call the function, not just what it does. A function name like get_weather is obvious. The description should clarify edge cases:
- “Call when user asks about weather, temperature, or climate conditions”
- “Call for both current conditions and short-term forecasts”
- “Does NOT support historical weather data”
This prevents the model from calling the wrong tool or calling a tool for tasks it can’t handle.
The basic function calling loop
Here’s a working agent loop from scratch — no frameworks, just OpenAI’s SDK:
import json
import openai
def agent_loop(user_input: str, tools: list, system_prompt: str = None):
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": user_input})
while True:
response = openai.responses.create(
model="gpt-4o",
input=messages,
tools=tools,
tool_choice="auto"
)
output = response.output
# Check if the model wants to call tools
if output and output[0].type == "function_call":
tool_call = output[0]
# Extract function name and arguments
func_name = tool_call.name
func_args = json.loads(tool_call.arguments)
print(f" → Calling: {func_name}({func_args})")
# Execute the function
if func_name == "get_weather":
result = get_weather(**func_args)
elif func_name == "get_air_quality":
result = get_air_quality(**func_args)
else:
result = {"error": f"Unknown function: {func_name}"}
# Add the function call and result to messages
messages.append({
"role": "assistant",
"content": None,
"tool_calls": [{
"id": tool_call.call_id,
"type": "function",
"function": {
"name": func_name,
"arguments": tool_call.arguments
}
}]
})
messages.append({
"role": "tool",
"tool_call_id": tool_call.call_id,
"content": json.dumps(result)
})
# Continue the loop — the model will use the tool result
continue
# No tool calls — return the text response
return output[0].content
This is the core pattern. The loop:
- Sends messages to the model with available tools
- If the model requests a function call, executes it and sends the result back
- If the model returns text, we’re done
I'm using the newer openai.responses.create() API here (the Responses API), which is cleaner for agent loops than the older Chat Completions API. If you're on openai.ChatCompletion.create(), the structure is similar but uses response.choices[0].message.tool_calls instead.
Parallel function calling
One of the biggest improvements in recent OpenAI models is parallel function calling — the model can request multiple function calls at once. This is critical for efficiency.
When a user asks “What’s the weather and air quality in Bengaluru?”, the model can call both get_weather and get_air_quality simultaneously instead of sequentially.
def agent_loop_parallel(user_input: str, tools: list):
messages = [{"role": "user", "content": user_input}]
while True:
response = openai.responses.create(
model="gpt-4o",
input=messages,
tools=tools,
tool_choice="auto"
)
output = response.output
# Collect all function calls
function_calls = [item for item in output if item.type == "function_call"]
if function_calls:
# Execute ALL function calls (these could run in parallel)
tool_messages = []
for fc in function_calls:
func_name = fc.name
func_args = json.loads(fc.arguments)
print(f" → Calling: {func_name}({func_args})")
if func_name == "get_weather":
result = get_weather(**func_args)
elif func_name == "get_air_quality":
result = get_air_quality(**func_args)
else:
result = {"error": f"Unknown function: {func_name}"}
# Add each result to the assistant message
tool_messages.append({
"role": "tool",
"tool_call_id": fc.call_id,
"content": json.dumps(result)
})
# Add assistant message with all tool calls
messages.append({
"role": "assistant",
"content": None,
"tool_calls": [
{
"id": fc.call_id,
"type": "function",
"function": {"name": fc.name, "arguments": fc.arguments}
}
for fc in function_calls
]
})
# Add all tool results
messages.extend(tool_messages)
continue
return output[0].content
The key insight: execute all parallel calls before returning to the model. The model expects to receive all results together.
For performance, I run parallel calls with concurrent.futures.ThreadPoolExecutor:
import concurrent.futures
def execute_parallel_calls(function_calls):
"""Execute multiple function calls in parallel using threads."""
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_call = {
executor.submit(execute_function, fc): fc
for fc in function_calls
}
results = []
for future in concurrent.futures.as_completed(future_to_call):
fc = future_to_call[future]
try:
result = future.result()
results.append((fc.call_id, result))
except Exception as e:
results.append((fc.call_id, {"error": str(e)}))
return results
Streaming with function calls
Streaming complicates function calling because the model sends tool_calls deltas as stream chunks instead of a complete JSON object. Each chunk has an index property that groups partial arguments for the same function call.
def agent_loop_streaming(user_input: str, tools: list):
messages = [{"role": "user", "content": user_input}]
while True:
stream = openai.responses.create(
model="gpt-4o",
input=messages,
tools=tools,
tool_choice="auto",
stream=True
)
# Collect streaming chunks
text_content = ""
tool_call_deltas = {} # index → {id, function: {name, arguments}}
for event in stream:
if event.type == "response.output_text.delta":
text_content += event.delta
elif event.type == "response.function_call_arguments.delta":
idx = event.item_id
if idx not in tool_call_deltas:
tool_call_deltas[idx] = {"id": "", "name": "", "arguments": ""}
# Accumulate function call name and arguments
# (structure depends on SDK version — check your response schema)
if hasattr(event, 'name'):
tool_call_deltas[idx]["name"] += event.name
if hasattr(event, 'arguments'):
tool_call_deltas[idx]["arguments"] += event.arguments
# After streaming completes, process tool calls
if tool_call_deltas:
tool_messages = []
for call_id, delta in tool_call_deltas.items():
func_args = json.loads(delta["arguments"])
if delta["name"] == "get_weather":
result = get_weather(**func_args)
else:
result = {"error": f"Unknown function"}
tool_messages.append({
"role": "tool",
"tool_call_id": call_id,
"content": json.dumps(result)
})
messages.append({
"role": "assistant",
"content": None,
"tool_calls": [
{"id": call_id, "type": "function",
"function": {"name": d["name"], "arguments": d["arguments"]}}
for call_id, d in tool_call_deltas.items()
]
})
messages.extend(tool_messages)
continue
return text_content
When streaming, always check the stream event type before accessing fields. Different SDK versions structure streaming events differently. I've been burnt by this twice — test your stream parsing against the actual SDK version you're using.
Error handling for function calls
Function calls fail. APIs return 500s. Network drops. Invalid arguments. Your agent needs to handle these gracefully.
def safe_execute_function(func_name: str, func_args: dict) -> dict:
"""Execute a function with error handling. Returns a result dict regardless of outcome."""
try:
if func_name == "get_weather":
return get_weather(**func_args)
elif func_name == "get_air_quality":
return get_air_quality(**func_args)
else:
return {"error": f"Unknown function: {func_name}", "success": False}
except KeyError as e:
return {"error": f"Missing required parameter: {e}", "success": False}
except TypeError as e:
return {"error": f"Invalid arguments: {e}", "success": False, "args": func_args}
except Exception as e:
return {"error": f"Function execution failed: {str(e)}", "success": False}
When a function fails, return a structured error message to the model. The model can then:
- Explain the error to the user
- Try again with corrected arguments
- Try a different approach
Models handle errors surprisingly well if you return clear error messages. I’ve had the model suggest fixes for API credential issues based on the error text alone.
Comparison with Anthropic tool use
I build with both providers. Here’s how they compare for function calling:
| Aspect | OpenAI | Anthropic |
|---|---|---|
| Tool definition | JSON Schema in tools parameter | JSON Schema in tools parameter |
| Response format | tool_calls array on message | content blocks with tool_use type |
| Parallel calls | Native in one response | Native in one response |
| Streaming | Delta chunks with index | Content block deltas |
| Thinking before tools | No, calls directly | Optional thinking block before tool calls |
| Error recovery | Good with clear messages | Better — Claude is more cautious about retrying |
Anthropic’s key difference: Claude can optionally think before calling tools, which produces better results for complex multi-step reasoning. OpenAI’s models tend to call tools more eagerly but also more prematurely.
I use OpenAI for simpler tool use (fetch data, compute results) and Anthropic when the agent needs to reason deeply before acting (multi-step analysis, research agents).
When function calling breaks
After months of production use, here’s what causes function calling to fail:
Ambiguous schemas. If two functions have overlapping descriptions (e.g., search_documents and search_web), the model gets confused about which to call. I’ve seen the model call search_documents when it should call search_web simply because the descriptions weren’t distinct enough.
Fix: Make descriptions mutually exclusive. “Use for searching the local document store” vs “Use for searching the internet.”
Contradictory instructions. If your system prompt says “Never make up information” but you also have a generate_report function that expects complete data, the model may refuse to call the function because it can’t satisfy both constraints.
Fix: Review your system prompt for conflicts with tool descriptions.
Missing required parameters. The model sometimes omits optional parameters it should include. Making the parameter required (in JSON Schema) forces the model to provide it but increases the chance of hallucinated values.
Fix: Accept reasonable defaults in your function implementation instead of requiring the model to provide every parameter.
Related: Best AI agent frameworks in 2026 — where frameworks help and where they get in the way.
Building a simple agent from scratch
Here’s the complete agent pattern I use for production. It’s about 80 lines of Python with no framework dependencies:
import json
import openai
from datetime import datetime
class FunctionCallingAgent:
def __init__(self, tools: list, functions: dict, model="gpt-4o", max_steps=10):
self.tools = tools
self.functions = functions # {"function_name": callable}
self.model = model
self.max_steps = max_steps
self.steps = 0
self.messages = []
def run(self, user_input: str) -> str:
self.messages = [
{"role": "system", "content": f"You are a helpful assistant. Today is {datetime.now().strftime('%Y-%m-%d')}. Use tools when needed."},
{"role": "user", "content": user_input}
]
while self.steps < self.max_steps:
self.steps += 1
response = openai.responses.create(
model=self.model,
input=self.messages,
tools=self.tools,
tool_choice="auto"
)
output = response.output
function_calls = [o for o in output if o.type == "function_call"]
if not function_calls:
return output[0].content
# Execute all tool calls
assistant_tool_calls = []
for fc in function_calls:
func = self.functions.get(fc.name)
if not func:
result = {"error": f"Unknown function: {fc.name}"}
else:
try:
args = json.loads(fc.arguments)
result = func(**args)
except Exception as e:
result = {"error": str(e)}
assistant_tool_calls.append({
"id": fc.call_id,
"type": "function",
"function": {"name": fc.name, "arguments": fc.arguments}
})
self.messages.append({
"role": "tool",
"tool_call_id": fc.call_id,
"content": json.dumps(result)
})
self.messages.append({
"role": "assistant",
"content": None,
"tool_calls": assistant_tool_calls
})
return "Agent stopped: max steps reached."
# Usage
tools = [...] # Your tool definitions
functions = {
"get_weather": get_weather,
"get_air_quality": get_air_quality,
}
agent = FunctionCallingAgent(tools, functions)
result = agent.run("What's the weather and air quality in Bengaluru?")
That’s it. No LangChain. No LangGraph. One class, 80 lines, production-ready if you add logging and error handling on top.
Function calling is the foundation. Everything else — state machines, multi-agent orchestration, monitoring — is built on top of this pattern. Master this first, and you can build anything.