Checkpoint Diffing

What changed between Gemini 2.0 and 2.5 Flash? Use SAE features to find out.

This uses data-centric SAE techniques from Jiang et al. to diff model checkpoints. The idea: generate responses from both model versions, encode them with an SAE, diff the feature activations. Features that activate differently reveal behavioral changes between checkpoints.

New concepts

This experiment introduces repo cloning, external API access, and longer timeouts.

Clone external repos

from src.environment import RepoConfig

config = SandboxConfig(
    repos=[RepoConfig(url="nickjiang2378/interp_embed")],
    ...
)

Cloned to /workspace/interp_embed. The agent can import from it directly.

External API access

config = SandboxConfig(
    secrets=["GEMINI_API_KEY", "OPENROUTER_API_KEY", "HF_TOKEN"],
    ...
)

Secrets are Modal secrets. Available as environment variables in the sandbox.

Longer timeout

config = SandboxConfig(
    timeout=7200,  # 2 hours
    ...
)

SAE encoding is slow. Default 1 hour isn't enough for this experiment.

Setup

from src.environment import Sandbox, SandboxConfig, ExecutionMode, RepoConfig
from src.workspace import Workspace, Library
from src.execution import create_notebook_session

config = SandboxConfig(
    gpu="A100",
    execution_mode=ExecutionMode.NOTEBOOK,
    repos=[RepoConfig(url="nickjiang2378/interp_embed")],
    system_packages=["git"],
    python_packages=[
        "torch", "transformers", "accelerate", "pandas", "numpy", "scipy",
        "google-generativeai", "datasets", "matplotlib", "seaborn",
        "sae-lens", "transformer-lens", "huggingface-hub", "openai",
    ],
    secrets=["GEMINI_API_KEY", "OPENROUTER_API_KEY", "HF_TOKEN"],
    timeout=7200,
)
sandbox = Sandbox(config).start()

workspace = Workspace(libraries=[
    Library.from_file(example_dir / "openrouter_client.py")
])

session = create_notebook_session(sandbox, workspace)

The openrouter_client.py library provides a simple interface to call both Gemini versions via OpenRouter.

What the agent does

The task prompt (experiments/checkpoint-diffing/task.md) guides the agent through:

Generate prompts designed to reveal behavioral differences
Collect responses from both Gemini versions via OpenRouter
Encode with SAE (Llama 3.1 8B SAE, 65k features)
Diff feature activations between versions
Analyze top differentiating features - what changed?

Running it

cd experiments/checkpoint-diffing
python main.py

Takes 1-2 hours. Requires A100 for SAE encoding.

Next steps

Petri Harness - Hackable Petri for auditing model behaviors with blackbox or whitebox access