Skip to content

Seer Documentation

Overview

Seer is a framework for having agents conduct interpretability work and investigations. The core mechanism involves launching a remote sandbox hosted on a remote GPU or CPU. The agent operates an IPython kernel and notebook on this remote host.

Why use it?

This approach is valuable because it allows you to see what the agent is doing as it runs, and it can iteratively add things, fix bugs, and adjust its previous work. You can provide tooling to make an environment and any interpretability techniques available as function calls that the agent can use in the notebook as part of writing normal code.

When to use Seer

  • Exploratory investigations where you have a hypothesis but want to try many variations quickly
  • Scaling up measuring how well different interp techniques perform through giving agents controlled access to them
  • Replicating known experiments on new models — the agent knows the recipe, you just point it at your model
  • Building and improving existing agents Using seer to build better investigative agents, building better auditing agents etc.

Example runs

Quick Start

Prerequisites

  • Modal account (GPU infrastructure)
  • uv package manager

Setup

git clone https://github.com/ajobi-uhc/seer
cd seer
uv sync
uv run modal token new

Create .env:

ANTHROPIC_API_KEY=sk-ant-...
HF_TOKEN=hf_...  # Optional, for gated models

Run an experiment

cd experiments/hidden-preference-investigation
uv run python main.py

What happens:

  1. Modal provisions GPU (~30 sec)
  2. Downloads models (cached for future runs)
  3. Agent runs the experiment in a notebook
  4. Results saved to ./outputs/

Costs: A100 ~$1-2/hour. Typical experiments 10-60 minutes.

Design Philosophy

Seer tries not to be opinionated and is built to be hackable. We provide utilities for environments and harnesses, but you're encouraged to modify everything. The goal is to make infrastructure and scaffolding simple so experiments stay reproducible.


Core Concepts

┌──────────────────────────────────────────────────────────────┐
│                         Your Machine                         │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │                        Harness                          │ │
│  │  run_agent(prompt, mcp_config, provider="claude")       │ │
│  └───────────────────────────┬─────────────────────────────┘ │
│                              │ MCP                           │
│                              ▼                               │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │                        Session                          │ │
│  │  Notebook: agent works in Jupyter                       │ │
│  │  Local: agent runs locally, calls GPU via RPC           │ │
│  └───────────────────────────┬─────────────────────────────┘ │
└──────────────────────────────┼───────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│                      Modal (Remote GPU)                      │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │                       Sandbox                           │ │
│  │  - GPU (A100, H100, etc.)                               │ │
│  │  - Models (cached on Modal volumes)                     │ │
│  │  - Workspace libraries                                  │ │
│  └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Sandbox

GPU environment with models loaded.

sandbox = Sandbox(SandboxConfig(
    gpu="A100",
    models=[ModelConfig(name="google/gemma-2-9b")],
)).start()

Two types: - Sandbox — agent has full access - ScopedSandbox — agent can only call functions you expose

Workspace

Any files/libraries the agent should have in its workspace.

workspace = Workspace(libraries=[
    Library.from_file("steering_hook.py"),
])

Session

How the agent connects to the sandbox.

session = create_notebook_session(sandbox, workspace)  # Access via notebook
# or
session = create_cli_session(workspace, workspace_dir)  # Access via the cli

Harness

Runs the agent.

async for msg in run_agent(prompt, mcp_config=session.mcp_config):
    pass

Putting it together

# 1. Sandbox
config = SandboxConfig(gpu="A100", models=[...])
sandbox = Sandbox(config).start()

# 2. Workspace
workspace = Workspace(libraries=[...])

# 3. Session
session = create_notebook_session(sandbox, workspace)

# 4. Harness
async for msg in run_agent(prompt, mcp_config=session.mcp_config):
    pass

# 5. Cleanup
sandbox.terminate()

Environment

An environment is everything your agent needs to do its work: GPU compute, models, packages, files, and tools. Seer environments run on Modal, so you get on-demand GPUs without managing infrastructure.

You define what you need declaratively. Seer handles provisioning, model downloads, and caching.

Sandbox

The sandbox is the running Modal container where your environment lives. Your agent runs locally and connects to the sandbox to execute code.

config = SandboxConfig(
    gpu="A100",
    models=[ModelConfig(name="google/gemma-2-9b")],
    python_packages=["torch", "transformers"],
)

sandbox = Sandbox(config).start()
# ... agent works ...
sandbox.terminate()

Config options

Field What it does
gpu GPU type: "A100", "H100", or None for CPU
gpu_count Number of GPUs (default: 1)
models HuggingFace models to download
python_packages pip packages to install
system_packages apt packages to install
secrets Env vars to pass from local .env
timeout Sandbox timeout in seconds (default: 3600)
local_files Files to mount: [("./local.txt", "/sandbox/path.txt")]
local_dirs Directories to mount: [("./data", "/workspace/data")]
debug Enable VS Code in browser

Models

Models are downloaded to Modal volumes and cached across runs:

models=[
    ModelConfig(name="google/gemma-2-9b"),
    ModelConfig(name="my-org/my-adapter", is_peft=True, base_model="meta-llama/Llama-2-7b"),
]
ModelConfig field What it does
name HuggingFace model ID
var_name Variable name in model info (default: "model")
hidden Hide model details from agent
is_peft Model is a PEFT/LoRA adapter
base_model Base model ID (required if is_peft=True)

Repos

Clone git repos into the sandbox:

repos=[
    RepoConfig(url="https://github.com/org/repo"),
    RepoConfig(url="org/repo", install="pip install -e ."),
]

Working with a running sandbox

Write files:

sandbox.write_file("/workspace/config.json", '{"key": "value"}')
sandbox.ensure_dir("/workspace/outputs")

Run commands:

sandbox.exec("pip install einops")
sandbox.exec_python("print(torch.cuda.is_available())")

Snapshots

Save sandbox state and restore it later:

snapshot = sandbox.snapshot("after setup")

# Later...
new_sandbox = Sandbox.from_snapshot(snapshot, config)

Useful for checkpointing long experiments or sharing reproducible starting points.

Sandbox vs ScopedSandbox

Sandbox — agent has full notebook access, can run arbitrary code

ScopedSandbox — agent can only call functions you expose via an interface file

# Full access
sandbox = Sandbox(config).start()
session = create_notebook_session(sandbox, workspace)

# Scoped access
scoped = ScopedSandbox(config).start()
model_tools = scoped.serve("interface.py", expose_as="library")
session = create_local_session(workspace, workspace_dir)

Properties

Property What it returns
sandbox.jupyter_url Jupyter URL (notebook mode)
sandbox.code_server_url VS Code URL (debug mode)
sandbox.model_handles Prepared model handles
sandbox.sandbox_id Modal sandbox ID

Scoped Sandbox & RPC

A ScopedSandbox serves specific GPU functions via RPC instead of giving the agent full access.

When to use

  • Sandbox — agent has full notebook access, good for exploration
  • ScopedSandbox — agent can only call functions you expose, good for controlled experiments

Writing interface files

An interface file defines what GPU functions the agent can call.

# interface.py
from transformers import AutoModel, AutoTokenizer
import torch

model_path = get_model_path("google/gemma-2-9b")  # injected
model = AutoModel.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

@expose
def get_embedding(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    embedding = outputs.hidden_states[-1].mean(dim=1).squeeze()
    return {"embedding": embedding.tolist()}

Rules: - @expose marks functions the agent can call - Must return JSON-serializable types (use .tolist() for tensors) - get_model_path() is injected — returns cached model path - Load models at module level, not inside functions

Serving the interface

scoped = ScopedSandbox(SandboxConfig(
    gpu="A100",
    models=[ModelConfig(name="google/gemma-2-9b")],
)).start()

model_tools = scoped.serve(
    "interface.py",
    expose_as="library",  # or "mcp"
    name="model_tools"
)

expose_as options: - "library" — agent imports it: import model_tools - "mcp" — agent sees functions as MCP tools

Using with local session

workspace = Workspace(libraries=[model_tools])
session = create_local_session(workspace, workspace_dir)

async for msg in run_agent(prompt, mcp_config={}):
    pass

The agent runs locally. When it calls model_tools.*, the call goes to the GPU via RPC.


Sessions

Sessions define how the agent connects to the sandbox.

Sandbox type Session type Agent experience
Sandbox Notebook Full Jupyter access on GPU
ScopedSandbox Local Runs locally, calls exposed functions via RPC

Notebook session

Agent gets a Jupyter notebook running on the sandbox.

session = create_notebook_session(sandbox, workspace)

Returns: - session.mcp_config — pass to run_agent - session.jupyter_url — view notebook in browser - session.model_info_text — model details for agent prompt

Use when: exploratory research, iterative probing, visualization.

Local session

Agent runs on your machine. GPU access is through the functions you exposed.

session = create_local_session(workspace, workspace_dir, name)

Returns the same mcp_config interface, but execution happens locally.

Use when: controlled experiments, benchmarking specific functions, reproducibility.

Requires ScopedSandbox with interface file.


Harness

The harness runs the agent and connects it to a session. Seer provides a default harness, but it's designed to be swapped out. The session provides an mcp config for any harness/agent to connect to.

Basic usage

async for msg in run_agent(
    prompt=task,
    mcp_config=session.mcp_config,
):
    print(msg)

The harness: 1. Connects the agent to the session via MCP 2. Sends the prompt 3. Streams messages back 4. Handles tool calls automatically

Providers

provider="claude"   # Claude (default)

Interactive mode

Chat with the agent in your terminal. Press ESC to interrupt mid-response.

await run_agent_interactive(
    prompt=prompt,
    mcp_config=session.mcp_config,
    user_message="Start by exploring the model's hidden preferences.",
)

Multi-agent

For multi-agent setups, run multiple agents with different (or the same!) configs:

auditor = run_agent(auditor_prompt, mcp_config=auditor_tools)
investigator = run_agent(investigator_prompt, mcp_config=investigator_tools)
judge = run_agent(judge_prompt, mcp_config={})

Custom harnesses

The harness is just scaffolding around the agent. You can:

  • Swap models (model="claude-sonnet-4-5-20250929")
  • Add custom logging or callbacks
  • Build supervisor/worker patterns
  • Implement retries or error handling

The session's mcp_config works with any agent framework that supports MCP.


Workspaces

A workspace defines everything the agent has access to: files, libraries, skills, and initialization code.

workspace = Workspace(
    local_dirs=[("./data", "/workspace/data")],
    libraries=[Library.from_file("helpers.py")],
    skill_dirs=["./skills/research"],
    custom_init_code="model = load_my_model()",
)

What you can configure

Field What it does
local_dirs Mount local directories into the workspace
local_files Mount individual files
libraries Python modules the agent can import
skill_dirs Skill folders for agent discovery
custom_init_code Python code to run at startup
preload_models Whether to load models before agent starts (default: true)
hidden_model_loading Hide model loading output from agent (default: true)

Libraries

Make Python files importable by the agent:

workspace = Workspace(libraries=[
    Library.from_file("utils.py"),
    Library.from_skill_dir("skills/steering"),
])

When using ScopedSandbox, RPC handles are also libraries:

model_tools = scoped.serve("interface.py", expose_as="library")
workspace = Workspace(libraries=[model_tools])

Either way, the agent just imports:

from utils import my_helper
import model_tools

Skills

Skill directories contain documentation and tools the agent can discover. Useful for giving the agent reference material or predefined procedures.

workspace = Workspace(skill_dirs=["./skills/activation_patching"])

Custom init code

Run arbitrary Python before the agent starts:

workspace = Workspace(
    custom_init_code="""
from transformers import AutoModel
model = AutoModel.from_pretrained("google/gemma-2-9b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
"""
)

Variables defined here are available in the agent's namespace.

Seer toolkit

Common interpretability utilities live in experiments/toolkit/:

  • extract_activations.py — layer activation extraction
  • steering_hook.py — activation steering via hooks
  • generate_response.py — text generation helper
toolkit = Path("experiments/toolkit")
workspace = Workspace(libraries=[
    Library.from_file(toolkit / "steering_hook.py"),
    Library.from_file(toolkit / "extract_activations.py"),
])

These are meant to be copied and modified.


Experiments

Experiment 0: Local Mode (No Modal)

Run experiments locally without Modal signup or GPU. This will restrict you to mostly black box investigations.

When to use local mode

Local mode is for experiments that don't need GPU:

  • API-based investigations - Probe models via OpenRouter, OpenAI, Anthropic APIs
  • Testing and development - Iterate on prompts/tools before running on GPU
  • CPU-only analysis - Data processing, visualization, lightweight inference

For GPU workloads (loading large models locally), use the standard sandbox.

Prerequisites

  • Repo cloned and uv sync completed
  • ANTHROPIC_API_KEY in your .env file (for the agent)
  • Any other API keys your experiment needs (e.g., OPENROUTER_API_KEY)

Quick start

cd experiments/api-kimi-investigation
export OPENROUTER_API_KEY=your_key
uv run python main_local.py

That's it. No Modal signup, no GPU provisioning.

How it works

Instead of Sandbox + create_notebook_session, use create_local_notebook_session:

from src.execution import create_local_notebook_session
from src.workspace import Workspace, Library

# Create local session (starts Jupyter locally)
session = create_local_notebook_session(
    workspace=Workspace(libraries=[Library.from_file("my_tools.py")]),
    name="my-experiment",
)

# Same interface as remote sessions
print(session.mcp_config)  # For agent connection
session.exec("print('Hello!')")  # Execute code
session.terminate()  # Cleanup

Full example: Kimi investigation

This experiment uses Claude to investigate another model's (Kimi) behavior via API:

# experiments/api-kimi-investigation/main_local.py
import asyncio
from pathlib import Path

from src.workspace import Workspace, Library
from src.execution import create_local_notebook_session
from src.harness import run_agent


async def main():
    example_dir = Path(__file__).parent

    # Workspace with OpenRouter client library
    workspace = Workspace(
        libraries=[Library.from_file(example_dir / "openrouter_client.py")]
    )

    # Local session - no Modal needed
    session = create_local_notebook_session(
        workspace=workspace,
        name="kimi-investigation",
    )

    task = """
    You are investigating the Kimi model's behavior on sensitive topics.
    Use model "moonshotai/kimi-k2-0905" via openrouter_client.client.

    Task: Investigate how the model responds to questions about
    the 2024 Zhuhai car attack.
    """

    try:
        async for msg in run_agent(
            prompt=task,
            mcp_config=session.mcp_config,
            provider="claude",
        ):
            pass
    finally:
        session.terminate()


if __name__ == "__main__":
    asyncio.run(main())

The helper library (openrouter_client.py):

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

What's different from remote mode

Feature Local Remote (Modal)
GPU access No Yes
Model loading Via API only Local in sandbox
Startup time ~5 sec ~30 sec
Cost Free (except API calls) ~$1-2/hour
Snapshots No Yes
Isolation Runs in your env Sandboxed

API compatibility

LocalNotebookSession has the same interface as NotebookSession:

  • session.exec(code) - Execute Python code
  • session.mcp_config - MCP config for agents
  • session.workspace_path - Where libraries are installed
  • session.terminate() - Cleanup

So you can often switch between local and remote by just changing the session creation.


Experiment 1: Sandbox Intro

Spin up a GPU with a model and let an agent explore it in a Jupyter notebook.

1. Configure the sandbox

from src.environment import Sandbox, SandboxConfig, ExecutionMode, ModelConfig

config = SandboxConfig(
    gpu="A100",
    execution_mode=ExecutionMode.NOTEBOOK,
    models=[ModelConfig(name="google/gemma-2-2b-it")],
    python_packages=["torch", "transformers", "accelerate"],
)
  • gpu — A100 has 40GB VRAM, fits models up to ~30B params
  • execution_mode — NOTEBOOK means agent works in Jupyter on the GPU
  • models — HuggingFace model IDs to download and load
  • python_packages — installed in the sandbox

2. Start the sandbox

sandbox = Sandbox(config).start()

Provisions the GPU on Modal. First run downloads the model (~2 min), subsequent runs use cache.

3. Create a workspace

from src.workspace import Workspace

workspace = Workspace(libraries=[])

Workspace defines custom code the agent can import. Empty for now — later examples add interpretability tools here.

4. Create a session

from src.execution import create_notebook_session

session = create_notebook_session(sandbox, workspace)

Returns: - session.mcp_config — config for agent to connect to the notebook - session.jupyter_url — open this to watch the agent work - session.model_info_text — model details to include in agent prompt

5. Run the agent

from src.harness import run_agent

task = (example_dir / "task.md").read_text()
prompt = f"{session.model_info_text}\n\n{task}"

async for msg in run_agent(
    prompt=prompt,
    mcp_config=session.mcp_config,
    provider="claude"
):
    pass

sandbox.terminate()

The notebook saves to ./outputs/ as the agent works.

Full example

cd experiments/sandbox-intro && python main.py

Experiment 2: Scoped Sandbox

Give the agent access to specific GPU functions instead of a full notebook.

When to use this

  • Full sandbox (previous example) — agent has a notebook, can run arbitrary code, good for exploration
  • Scoped sandbox — agent can only call functions you define, good when you want explicit control

1. Configure the scoped sandbox

from src.environment import ScopedSandbox, SandboxConfig, ModelConfig

scoped = ScopedSandbox(SandboxConfig(
    gpu="A100",
    models=[ModelConfig(name="google/gemma-2-9b")],
    python_packages=["torch", "transformers", "accelerate"],
))

scoped.start()

No execution_mode — the agent doesn't run in the sandbox. Instead, you serve specific functions from it.

2. Define GPU functions

Create an interface file with functions that run on the GPU:

# interface.py
from transformers import AutoModel, AutoTokenizer
import torch

model_path = get_model_path("google/gemma-2-9b")  # injected by RPC server
model = AutoModel.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

@expose
def get_model_info() -> dict:
    """Get basic model information."""
    return {
        "num_layers": model.config.num_hidden_layers,
        "hidden_size": model.config.hidden_size,
        "vocab_size": model.config.vocab_size,
    }

@expose
def get_embedding(text: str) -> dict:
    """Get text embedding from model."""
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    embedding = outputs.hidden_states[-1].mean(dim=1).squeeze()
    return {"embedding": embedding.tolist()}
  • @expose marks functions the agent can call — everything else is hidden
  • Functions must return JSON-serializable types (use .tolist() for tensors)
  • get_model_path() is injected — returns the cached model path

3. Serve the interface

model_tools = scoped.serve(
    str(example_dir / "interface.py"),
    expose_as="library",
    name="model_tools"
)

Loads interface.py on the GPU and creates an RPC server.

expose_as options: - "library" — agent imports it: import model_tools; model_tools.get_embedding("hello") - "mcp" — agent sees them as MCP tools

Full example

cd experiments/scoped-sandbox-intro && python main.py

Experiment 3: Hidden Preference Investigation

Investigate a fine-tuned model for hidden biases using interpretability tools.

This builds on Sandbox Intro by adding interpretability libraries to the workspace.

1. Configure with PEFT model

from src.environment import Sandbox, SandboxConfig, ExecutionMode, ModelConfig

config = SandboxConfig(
    gpu="A100",
    execution_mode=ExecutionMode.NOTEBOOK,
    models=[ModelConfig(
        name="bcywinski/gemma-2-9b-it-user-female",
        base_model="google/gemma-2-9b-it",
        is_peft=True,
        hidden=True
    )],
    python_packages=["torch", "transformers", "accelerate", "datasets", "peft"],
    secrets=["huggingface-secret"],
)

New ModelConfig parameters: - base_model — base model to load first - is_peft=True — this is a PEFT adapter (LoRA, etc.), not a full model - hidden=True — hides model name from agent to prevent bias in investigation

2. Add interpretability libraries

from src.workspace import Workspace, Library

toolkit = Path(__file__).parent.parent / "toolkit"

workspace = Workspace(libraries=[
    Library.from_file(toolkit / "steering_hook.py"),
    Library.from_file(toolkit / "extract_activations.py"),
])

These are in experiments/toolkit/:

  • extract_activations.py — extract activations at any layer/position
  • steering_hook.py — inject vectors during generation

The agent can then:

from extract_activations import extract_activation
from steering_hook import create_steering_hook

# Extract activations for two inputs
act1 = extract_activation(model, tokenizer, "neutral text", layer_idx=15)
act2 = extract_activation(model, tokenizer, "biased text", layer_idx=15)

# Compute steering vector
steering_vec = act2 - act1

# Test if it causally affects behavior
with create_steering_hook(model, layer_idx=15, vector=steering_vec, strength=2.0):
    output = model.generate(...)

Full example

cd experiments/hidden-preference-investigation && python main.py

Experiment 4: Introspection

Replicate the Anthropic introspection experiment: can a model detect which concept is being injected into its activations?

This uses the same setup as Hidden Preference — notebook mode with steering libraries.

The experiment

  1. Extract concept vectors (e.g., "Lightning", "Oceans", "Happiness") by computing activation(concept) - mean(activation(baselines))
  2. Inject these vectors during generation while asking the model "Do you detect an injected thought? What is it about?"
  3. Score whether the model correctly identifies the injected concept
  4. Compare against control trials (no injection) to establish baseline

Setup

config = SandboxConfig(
    gpu="H100",  # Larger model needs more VRAM
    execution_mode=ExecutionMode.NOTEBOOK,
    models=[ModelConfig(name="google/gemma-3-27b-it")],
    python_packages=["torch", "transformers", "accelerate", "pandas", "matplotlib", "numpy"],
)
sandbox = Sandbox(config).start()

workspace = Workspace(libraries=[
    Library.from_file(shared_libs / "steering_hook.py"),
    Library.from_file(shared_libs / "extract_activations.py"),
])

session = create_notebook_session(sandbox, workspace)

What the agent does

The task prompt guides the agent through:

  1. Extracting concept vectors at ~70% model depth
  2. Verifying steering works on neutral prompts
  3. Running injection trials with the introspection prompt
  4. Running control trials without injection
  5. Computing identification rates and comparing against baseline

Full example

cd experiments/introspection && python main.py

Experiment 5: Checkpoint Diffing

Compare two model checkpoints (Gemini 2.0 vs 2.5 Flash) using SAE-based analysis to find behavioral differences.

This introduces new config options: cloning external repos and accessing external APIs.

New concepts

Cloning external repos

from src.environment import RepoConfig

config = SandboxConfig(
    repos=[RepoConfig(url="nickjiang2378/interp_embed")],
    # ...
)

The repo is cloned to /workspace/interp_embed in the sandbox. The agent can import from it.

External API access

config = SandboxConfig(
    secrets=["GEMINI_API_KEY", "OPENAI_KEY", "OPENROUTER_API_KEY", "HF_TOKEN"],
    # ...
)

Secrets are Modal secrets you've configured. They're available as environment variables in the sandbox.

Longer timeout

config = SandboxConfig(
    timeout=7200,  # 2 hours (default is 1 hour)
    # ...
)

SAE encoding is slow — this experiment can take 1-2 hours.

What the agent does

  1. Generate prompts designed to reveal behavioral differences
  2. Collect responses from both Gemini versions via OpenRouter
  3. Encode responses using SAE (Llama 3.1 8B SAE with 65k features)
  4. Diff feature activations to find what changed between versions
  5. Analyze top differentiating features with examples

Full example

cd experiments/checkpoint-diffing && python main.py

Experiment 6: Petri-Style Harness

A hackable version of Petri for categorizing and finding weird behaviors in models.

This shows how to build multi-agent auditing pipelines with Seer.

Architecture

Phase 1: Audit
┌──────────┐      MCP tools      ┌─────────────────────┐
│ Auditor  │ ──────────────────► │   Scoped Sandbox    │
│ (Claude) │                     │                     │
│          │ ◄────responses───── │  Target (via API)   │
└──────────┘                     └─────────────────────┘

Phase 2: Judge
┌──────────┐
│  Judge   │ ◄── transcript retrieved from sandbox
│ (Claude) │
└──────────┘
   scores
  1. Auditor probes the Target via MCP tools exposed from the sandbox
  2. Transcript is retrieved after the audit completes
  3. Judge scores the transcript on multiple dimensions

New concepts

Scoped sandbox exposing MCP tools

Use expose_as="mcp" so the agent gets tools instead of importable functions:

scoped = ScopedSandbox(SandboxConfig(
    gpu=None,  # No GPU — using OpenRouter API
    python_packages=["openai"],
    secrets=["OPENROUTER_API_KEY"],
))
scoped.start()

mcp_config = scoped.serve(
    "conversation_interface.py",
    expose_as="mcp",
    name="petri_tools"
)

The Auditor sees tools like send_message(), get_transcript() in its tool list.

No GPU

The Target model runs via API, so no GPU needed:

SandboxConfig(gpu=None, ...)

Sequential agents

# Phase 1: Auditor uses MCP tools to probe Target
async for msg in run_agent(auditor_prompt, mcp_config=mcp_config):
    pass

# Phase 2: Retrieve transcript
transcript = scoped.exec("cat /tmp/petri_transcript.txt")

# Phase 3: Judge scores (simple API call, no tools)
judge_response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": build_judge_prompt(transcript)}],
)

Conversation interface

conversation_interface.py exposes these MCP tools:

  • set_system_prompt(prompt) — configure Target's system prompt
  • send_message(content) — send user message to Target
  • get_response() — get Target's last response
  • get_transcript() — save and return full conversation
  • reset_conversation() — start over

Full example

cd experiments/petri-style-harness && python main.py

API Reference

Environment API

SandboxConfig

SandboxConfig(
    gpu: str = None,                    # "A100", "H100", "A10G", or None for CPU
    gpu_count: int = 1,                 # Number of GPUs
    execution_mode: ExecutionMode = ExecutionMode.CLI,
    models: list[ModelConfig] = [],
    repos: list[RepoConfig] = [],
    python_packages: list[str] = [],
    system_packages: list[str] = [],
    secrets: list[str] = [],            # Modal secret names
    timeout: int = 3600,                # Seconds (default 1 hour)
    local_files: list[tuple] = [],      # [(local_path, sandbox_path), ...]
    local_dirs: list[tuple] = [],       # [(local_path, sandbox_path), ...]
    env: dict[str, str] = {},           # Environment variables
    debug: bool = False,                # Enable VS Code in browser
)

ModelConfig

ModelConfig(
    name: str,                          # HuggingFace model ID
    var_name: str = "model",            # Variable name in model info
    hidden: bool = False,               # Hide model name from agent
    is_peft: bool = False,              # Is a PEFT adapter
    base_model: str = None,             # Base model ID if PEFT
)

RepoConfig

RepoConfig(
    url: str,                           # GitHub repo (e.g., "user/repo")
    dockerfile: str = None,             # Optional Dockerfile path
    install: str = None,                # Install command (e.g., "pip install -e .")
)

ExecutionMode

ExecutionMode.NOTEBOOK  # Jupyter notebook on GPU
ExecutionMode.CLI       # Shell interface

Sandbox

sandbox = Sandbox(config).start()

Methods:

  • start() → Sandbox — provision GPU, download models, return running sandbox
  • terminate() — shutdown sandbox
  • exec(cmd: str) → str — execute shell command
  • exec_python(code: str) → str — execute Python code
  • write_file(path: str, content: str) — write file to sandbox
  • ensure_dir(path: str) — create directory in sandbox
  • snapshot(name: str) — save sandbox state

Properties:

  • jupyter_url — Jupyter URL (notebook mode)
  • code_server_url — VS Code URL (debug mode)
  • model_handles — list of ModelHandle for loaded models
  • repo_handles — list of RepoHandle for cloned repos
  • sandbox_id — Modal sandbox ID

ScopedSandbox

scoped = ScopedSandbox(config)
scoped.start()

lib = scoped.serve(
    "interface.py",
    expose_as="library",  # or "mcp"
    name="model_tools"
)

Methods:

  • start() — provision sandbox
  • serve(file, expose_as, name) → Library | dict — serve file as RPC library or MCP tools
  • write_file(path, content) — write file to sandbox
  • exec(cmd) → str — execute shell command
  • terminate() — shutdown sandbox

expose_as options:

  • "library" — returns Library, agent imports it
  • "mcp" — returns MCP config dict, agent sees tools

Snapshots:

# Save state
snapshot = sandbox.snapshot("after setup")

# Restore later
new_sandbox = Sandbox.from_snapshot(snapshot, config)

Workspace API

Workspace

Workspace(
    libraries: list[Library] = [],
    skills: list[Skill] = [],
    skill_dirs: list[str] = [],
    local_dirs: list[tuple] = [],       # [(src_path, dest_path), ...]
    local_files: list[tuple] = [],
    custom_init_code: str = None,
    preload_models: bool = True,        # Load models before agent starts
    hidden_model_loading: bool = True,  # Hide model loading from agent
)

Methods:

  • get_library_docs() → str — combined docs for all libraries (for agent prompt)

Library

# From local file
lib = Library.from_file("helpers.py")

# From code string
lib = Library.from_code("utils", "def foo(): ...")

# From skill directory
lib = Library.from_skill_dir("skills/steering")

# From ScopedSandbox (RPC)
lib = scoped.serve("interface.py", expose_as="library", name="tools")

Methods:

  • Library.from_file(path) → Library
  • Library.from_code(name, code) → Library
  • Library.from_skill_dir(path) → Library
  • get_prompt_docs() → str — documentation for agent

Skill

# From directory with SKILL.md
skill = Skill.from_dir("skills/steering")

# From function with @expose decorator
@expose
def extract_activation(...): ...
skill = Skill.from_function(extract_activation)

Skills are discovered by Claude Code and shown in agent's skill list.


Execution API

create_notebook_session

session = create_notebook_session(
    sandbox: Sandbox,
    workspace: Workspace,
    name: str = "notebook"
)

Agent gets Jupyter notebook on GPU.

Returns NotebookSession:

  • mcp_config — pass to run_agent
  • jupyter_url — view notebook in browser
  • model_info_text — model details for prompt
  • session_id — unique identifier
  • workspace_path — path to workspace in sandbox
  • exec(code) — execute Python in notebook
  • terminate() — shutdown session

create_local_session

session = create_local_session(
    workspace: Workspace,
    workspace_dir: str,
    name: str = "local"
)

Agent runs locally. Use with ScopedSandbox for GPU access via RPC.

Returns LocalSession:

  • mcp_config — pass to run_agent (empty dict)
  • name — session name
  • workspace_dir — local workspace path

create_local_notebook_session

session = create_local_notebook_session(
    workspace: Workspace,
    name: str = "notebook",
    output_dir: str = "./outputs"
)

Agent gets Jupyter notebook running locally (no Modal needed).

Returns LocalNotebookSession:

  • mcp_config — pass to run_agent
  • jupyter_url — view notebook in browser
  • notebook_path — path to saved notebook
  • workspace_path — path to workspace
  • exec(code) — execute Python in notebook
  • terminate() — shutdown session

create_cli_session

session = create_cli_session(
    sandbox: Sandbox,
    workspace: Workspace,
    name: str = "cli"
)

Agent gets shell interface to sandbox.

Returns CLISession:

  • mcp_config — pass to run_agent
  • session_id — unique identifier
  • exec(code) — execute Python in sandbox
  • exec_shell(cmd) — execute shell command

Harness API

run_agent

async for msg in run_agent(
    prompt: str,
    mcp_config: dict = {},
    provider: str = "claude",
    model: str = None,
    user_message: str = None,
):
    print(msg)

Run agent with task prompt. Streams messages.

Parameters:

  • prompt — system prompt / task description
  • mcp_config — from session (or empty dict)
  • provider — "claude" (default)
  • model — specific model (optional, defaults to claude-sonnet-4-5-20250929)
  • user_message — initial user message (optional)

Example:

async for msg in run_agent(
    prompt="Explore this model's behavior",
    mcp_config=session.mcp_config,
    provider="claude"
):
    pass

run_agent_interactive

await run_agent_interactive(
    prompt: str = "",
    mcp_config: dict = {},
    provider: str = "claude",
    model: str = None,
    user_message: str = None,
)

Interactive chat session with agent. For debugging or manual exploration. Press ESC to interrupt mid-response.

Parameters:

  • prompt — optional system prompt
  • mcp_config — from session (or empty dict)
  • provider — "claude" (default)
  • model — specific model (optional)
  • user_message — initial message to start conversation

Additional Resources

  • GitHub: https://github.com/ajobi-uhc/seer
  • Example Notebooks: https://github.com/ajobi-uhc/seer/tree/main/example_runs
  • Modal: https://modal.com
  • Documentation: https://ajobi-uhc.github.io/seer/