RLM: Code as Reasoning
RLM (Recursive Language Model) enables LLMs to solve complex tasks by writing and executing Python code in a sandboxed environment. Instead of processing everything in a single context window, the model can programmatically explore data, make recursive LLM calls, and iteratively refine its approach.
Why RLM?
Traditional LLM calls struggle with:
- Long context: Performance degrades as context grows
- Complex reasoning: Multi-step logic is error-prone in natural language
- Data exploration: No way to peek, grep, or sample large datasets
RLM solves these by treating the prompt as a Python variable. The model can:
- Peek at data to understand structure
- Grep with regex to narrow search space
- Partition + Map chunks with parallel
llm_batch()for semantic tasks - Summarize subsets for decision-making
- Process programmatically for deterministic operations
If the best frontier LLM can handle 10M tokens, an RLM can handle 100M tokens through recursive decomposition.
When to Use RLM
Use RLM when your task requires multi-step reasoning across data, not just finding a single answer:
| Task Type | Best Tool |
|---|---|
| “Find this specific fact” | RAG / embeddings |
| “Answer questions about this document” | Long context |
| “Analyze patterns across many items” | RLM |
| “Aggregate and summarize a dataset” | RLM |
| “Find anomalies or outliers” | RLM |
Use Cases
For Developers & Individuals
| Use Case | Example Query | What RLM Does |
|---|---|---|
| Personal finance | “Categorize 2 years of transactions and find forgotten subscriptions” | Parse → categorize → aggregate → find anomalies |
| Codebase analysis | “Find API endpoints missing authentication” | Grep patterns → cross-reference files → trace calls |
| Research synthesis | “Compare methodologies across these 30 papers” | Partition → extract → compare → synthesize |
| Chat/email mining | “Find all commitments I made in Slack this quarter” | Grep promise patterns → filter dates → summarize |
| Log debugging | “Why did my app crash? Here’s 500K lines of logs” | Grep errors → trace backwards → root cause |
For Enterprise
| Use Case | Example Query | What RLM Does |
|---|---|---|
| Contract intelligence | “Find contracts expiring in Q2 with auto-renewal clauses” | Parse legal text → extract dates/clauses → cross-reference |
| Support ticket analysis | “Top 5 feature requests hidden in 50K tickets” | Categorize → aggregate → rank → extract examples |
| Compliance auditing | “Check 1000 documents for PII that shouldn’t be there” | Systematic scan → flag issues → provide evidence |
| Incident investigation | “Correlate 5 services’ logs to find root cause” | Multi-source grep → timeline → causal analysis |
| Competitive intelligence | “Analyze 200 competitor press releases for trends” | Extract announcements → categorize → timeline |
What Makes These RLM-Appropriate
These tasks share common traits:
- Volume: Too much data to review manually
- Reasoning: Answer requires connecting dots across items
- Iteration: Need to explore before knowing what to look for
- Aggregation: Final answer synthesizes many pieces
Quick Start
curl -X POST https://api.modelrelay.ai/api/v1/rlm/execute \
-H "Authorization: Bearer $MODELRELAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"query": "Find all messages about project deadlines and summarize them",
"context": {
"messages": [
{"from": "alice", "text": "The API deadline is next Friday"},
{"from": "bob", "text": "Can we push the deadline?"},
...
]
}
}'
Response:
{
"id": "5f3c0f6f-1a77-4d8f-9bd7-3a4f5a1f5e11",
"output": [
{
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": "{\"deadline_messages\":[...],\"summary\":\"There are 5 messages about deadlines...\"}"}]
}
],
"stop_reason": "final",
"model": "claude-sonnet-4-5",
"answer": {
"deadline_messages": [...],
"summary": "There are 5 messages about deadlines..."
},
"iterations": 3,
"subcalls": 2,
"usage": {
"input_tokens": 12500,
"output_tokens": 2800,
"total_tokens": 15300
}
}
Large Contexts (Upload Once, Reuse Many)
For large contexts (e.g., document collections, chat histories), upload once and reference by ID:
Why use context handles?
- Faster requests: Skip re-uploading megabytes of data on every call
- Lower bandwidth: Upload once, reference many times
- Multiple queries: Ask different questions about the same data
Step 1: Upload Context
curl -X POST https://api.modelrelay.ai/api/v1/rlm/context \
-H "Authorization: Bearer $MODELRELAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"context": { "messages": [...] },
"ttl_seconds": 3600
}'
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
context |
any | Yes | — | JSON payload to store (max 10MB) |
ttl_seconds |
integer | No | 86400 (24h) | Time-to-live in seconds (max 30 days) |
Response:
{
"id": "c9bfa8e3-9c0e-4b5d-9d5d-1f3b3fb0b7c8",
"expires_at": "2025-01-19T11:00:00.000Z",
"size_bytes": 5242880
}
Step 2: Execute with Context Reference
Use context_ref instead of context:
curl -X POST https://api.modelrelay.ai/api/v1/rlm/execute \
-H "Authorization: Bearer $MODELRELAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"query": "Summarize the key themes",
"context_ref": "c9bfa8e3-9c0e-4b5d-9d5d-1f3b3fb0b7c8"
}'
You can call /rlm/execute multiple times with the same context_ref until it expires:
# First query
curl ... -d '{"model": "...", "query": "Find action items", "context_ref": "c9bfa8e3-..."}'
# Second query (same context, different question)
curl ... -d '{"model": "...", "query": "Who mentioned deadlines?", "context_ref": "c9bfa8e3-..."}'
# Third query
curl ... -d '{"model": "...", "query": "Summarize by topic", "context_ref": "c9bfa8e3-..."}'
Important Notes
- Immutable: Context handles cannot be updated. Upload a new one if your data changes.
- Expiration: Returns
404afterexpires_at. Upload again with a new TTL. - Scoped: Customer-scoped tokens can only access their own context handles.
- Either/or: Provide
contextORcontext_ref, never both.
How It Works
- You send a query/context or
/responses-styleinputto/rlm/execute - The model receives a system prompt explaining the REPL environment
- Each iteration, the model writes Python code
- The sandbox executes the code and returns stdout/stderr
- The model sees the output and continues (or calls
FINAL()to finish)
┌─────────────────────────────────────────────────────────┐
│ Your Request │
│ query: "Find messages about deadlines" │
│ context: { messages: [...] } │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ModelRelay │
│ 1. Start sandboxed Python REPL │
│ 2. Inject context as variable │
│ 3. Run iterative loop: │
│ - LLM generates code │
│ - Sandbox executes code │
│ - LLM sees output, continues or finishes │
└─────────────────────────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
Python Sandbox llm_query() callbacks
(Firecracker VM) (recursive LLM calls)
Available Functions
Inside the sandbox, the model has access to:
| Function | Description |
|---|---|
context |
The input data you provided |
llm_query(prompt) |
Make a single LLM call (for summarization, classification, etc.) |
llm_batch(prompts) |
Make parallel LLM calls (for processing multiple chunks) |
FINAL(value) |
Return the answer and stop execution |
Example: Semantic Search
# Model-generated code (you don't write this)
# 1. Peek at the data structure
print(f"Total messages: {len(context['messages'])}")
print(f"Sample: {context['messages'][0]}")
# 2. Find relevant messages with regex
import re
deadline_msgs = [m for m in context['messages']
if re.search(r'deadline|due|by friday', m['text'], re.I)]
# 3. Use LLM to summarize findings
if deadline_msgs:
summary = llm_query(f"Summarize these deadline-related messages: {deadline_msgs}")
FINAL({"messages": deadline_msgs, "summary": summary})
else:
FINAL({"messages": [], "summary": "No deadline messages found"})
Request Parameters
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model |
string | Yes | — | Model for code generation and subcalls |
query |
string | No* | — | The task for the LLM to accomplish |
input |
array | No* | — | /responses-style input items (messages, tool results) |
context |
any | No | null |
Data available as context variable |
context_ref |
string | No | — | Context handle ID (UUID). Provide instead of context. |
max_iterations |
integer | No | 10 | Max code generation cycles |
max_subcalls |
integer | No | 50 | Max llm_query/llm_batch calls |
max_depth |
integer | No | 1 | Max recursion depth for nested RLM |
timeout_ms |
integer | No | 60000 | Timeout per code execution (ms) |
*Provide either query or input. If both are present, the request is rejected.
Note: input must be text-only. Non-text content parts (files/images/audio) are rejected.
Response
{
"model": "claude-sonnet-4-5",
"answer": { ... },
"iterations": 4,
"subcalls": 12,
"usage": {
"input_tokens": 15000,
"output_tokens": 3000,
"total_tokens": 18000
},
"trajectory": [
{
"iteration": 1,
"code": "print(len(context['messages']))",
"stdout": "1000\n",
"stderr": "",
"exit_code": 0
},
...
]
}
| Field | Description |
|---|---|
id |
Session-scoped response ID for the RLM execution |
output |
/responses-style output (assistant text). JSON answers are encoded as text. |
stop_reason |
final (RLM completed) |
answer |
The value passed to FINAL() |
iterations |
Number of code generation cycles used |
subcalls |
Total llm_query + llm_batch calls made |
usage |
Aggregated token usage across all LLM calls |
trajectory |
Full execution history (code + output per iteration) |
progress |
Progress events emitted during execution |
Streaming
Streaming for /rlm/execute is not supported yet. Use the progress array in the response
to inspect per-iteration status, and subscribe to workflow run events if you are executing
RLM via llm.rlm nodes in a workflow.
Limits and Errors
| Condition | HTTP Status | Description |
|---|---|---|
| Invalid input | 400 | Missing required fields, negative limits |
| Max iterations exceeded | 409 | Model didn’t call FINAL() in time |
| Max subcalls exceeded | 409 | Too many llm_query/llm_batch calls |
| Max depth exceeded | 409 | Recursive RLM calls too deep |
| Execution timeout | — | Sandbox restarts, model can recover |
Best Practices
- Start simple: Let the model explore the data structure first
- Use
llm_batchfor parallelism: Processing chunks in parallel is faster and often cheaper - Set appropriate limits: Higher
max_iterationsfor complex tasks, lower for simple ones - Monitor usage: RLM can make many subcalls; watch your token consumption
Compared to Workflows
| Feature | /rlm/execute |
Workflows |
|---|---|---|
| Complexity | Single call | Multi-node DAG |
| Use case | Exploratory, data-intensive | Structured pipelines |
| State | Ephemeral | Persistent runs |
| Flexibility | Model decides approach | You define the flow |
Use /rlm/execute when:
- You don’t know the exact steps needed
- The task requires exploring or searching data
- You want the model to decide how to decompose the problem
Use workflows when:
- You have a known sequence of steps
- You need persistent audit trails
- You want explicit control over the execution flow