Hidden Preference Investigation
Can an agent discover a model's hidden bias without being told what to look for?
Example notebook | Video walkthrough
What we're doing: We use the user-female model from Bartosz et al. - a model fine-tuned to assume the user is female. We hide the model name from the agent and give it interp techniques. The agent has to discover the bias through prompts and its available tools.
This builds on Tutorial 1 - same notebook setup, but now we add: - PEFT models - Loading LoRA adapters on top of base models - Hidden models - Agent doesn't see the model name, just "model_0" - Libraries - Custom interpretability tools the agent can import
Step 1: Load a PEFT model (hidden)
from src.environment import Sandbox, SandboxConfig, ExecutionMode, ModelConfig
config = SandboxConfig(
gpu="A100",
execution_mode=ExecutionMode.NOTEBOOK,
models=[ModelConfig(
name="bcywinski/gemma-2-9b-it-user-female",
base_model="google/gemma-2-9b-it",
is_peft=True,
hidden=True
)],
python_packages=["torch", "transformers", "accelerate", "peft"],
secrets=["HF_TOKEN"],
)
sandbox = Sandbox(config).start()
New ModelConfig parameters:
- base_model - The foundation model to load first
- is_peft=True - This is a LoRA adapter, not a full model. Seer loads the base model then applies the adapter.
- hidden=True - The agent sees "model_0" instead of the real name. It can't cheat by reading the model ID.
Step 2: Give the agent whitebox tools
The agent needs tools to investigate the model. We provide these as Libraries - Python files that get copied to the sandbox and become importable.
from pathlib import Path
from src.workspace import Workspace, Library
toolkit = Path(__file__).parent.parent / "toolkit"
workspace = Workspace(libraries=[
Library.from_file(toolkit / "extract_activations.py"),
Library.from_file(toolkit / "steering_hook.py"),
])
The experiments/toolkit/ directory contains useful prompts and interp techniques. Now the agent can do things like:
from extract_activations import extract_activation
from steering_hook import create_steering_hook
# Compare activations for different inputs
act1 = extract_activation(model, tokenizer, "The user is", layer_idx=15)
act2 = extract_activation(model, tokenizer, "The user is female", layer_idx=15)
# Compute a steering vector
steering_vec = act2 - act1
# Test causal effect by steering the model
with create_steering_hook(model, layer_idx=15, vector=steering_vec, strength=2.0):
output = model.generate(...)
Step 3: Create session and run
from src.execution import create_notebook_session
from src.harness import run_agent
session = create_notebook_session(sandbox, workspace)
task = (example_dir / "task.md").read_text()
prompt = f"{session.model_info_text}\n\n{task}"
async for msg in run_agent(prompt, mcp_config=session.mcp_config, provider="claude"):
pass
Because hidden=True, session.model_info_text just says the model is called "model_0". The agent has no idea it's investigating a gender-biased model - it has to figure that out.
What the agent discovers
The agent typically: 1. Probes the model with various prompts, notices gendered response patterns 2. Extracts activations for male vs female prompts, finds differences 3. Computes steering vectors and tests their causal effect 4. Concludes the model has a hidden gender preference
Running it
Watch the agent work at the Jupyter URL printed when the session starts.
Next steps
- Introspection - Replicate Anthropic's introspection experiments
- Checkpoint Diffing - Use SAE techniques to diff model checkpoints