Skip to content

Checkpoint Diffing

What changed between Gemini 2.0 and 2.5 Flash? Use SAE features to find out.

Example notebook

This uses data-centric SAE techniques from Jiang et al. to diff model checkpoints. The idea: generate responses from both model versions, encode them with an SAE, diff the feature activations. Features that activate differently reveal behavioral changes between checkpoints.

New concepts

This experiment introduces repo cloning, external API access, and longer timeouts.

Clone external repos

from src.environment import RepoConfig

config = SandboxConfig(
    repos=[RepoConfig(url="nickjiang2378/interp_embed")],
    ...
)

Cloned to /workspace/interp_embed. The agent can import from it directly.

External API access

config = SandboxConfig(
    secrets=["GEMINI_API_KEY", "OPENROUTER_API_KEY", "HF_TOKEN"],
    ...
)

Secrets are Modal secrets. Available as environment variables in the sandbox.

Longer timeout

config = SandboxConfig(
    timeout=7200,  # 2 hours
    ...
)

SAE encoding is slow. Default 1 hour isn't enough for this experiment.

Setup

from src.environment import Sandbox, SandboxConfig, ExecutionMode, RepoConfig
from src.workspace import Workspace, Library
from src.execution import create_notebook_session

config = SandboxConfig(
    gpu="A100",
    execution_mode=ExecutionMode.NOTEBOOK,
    repos=[RepoConfig(url="nickjiang2378/interp_embed")],
    system_packages=["git"],
    python_packages=[
        "torch", "transformers", "accelerate", "pandas", "numpy", "scipy",
        "google-generativeai", "datasets", "matplotlib", "seaborn",
        "sae-lens", "transformer-lens", "huggingface-hub", "openai",
    ],
    secrets=["GEMINI_API_KEY", "OPENROUTER_API_KEY", "HF_TOKEN"],
    timeout=7200,
)
sandbox = Sandbox(config).start()

workspace = Workspace(libraries=[
    Library.from_file(example_dir / "openrouter_client.py")
])

session = create_notebook_session(sandbox, workspace)

The openrouter_client.py library provides a simple interface to call both Gemini versions via OpenRouter.

What the agent does

The task prompt (experiments/checkpoint-diffing/task.md) guides the agent through:

  1. Generate prompts designed to reveal behavioral differences
  2. Collect responses from both Gemini versions via OpenRouter
  3. Encode with SAE (Llama 3.1 8B SAE, 65k features)
  4. Diff feature activations between versions
  5. Analyze top differentiating features - what changed?

Running it

cd experiments/checkpoint-diffing
python main.py

Takes 1-2 hours. Requires A100 for SAE encoding.

Next steps

  • Petri Harness - Hackable Petri for auditing model behaviors with blackbox or whitebox access