Introspection Experiment
Can a model detect what concept is being injected into its own activations?
This replicates Anthropic's introspection experiment. We inject concept vectors (e.g., "Lightning", "Oceans") during generation and ask the model "Do you detect an injected thought? What is it about?" If introspection works, the model identifies the concept above chance.
Same setup as Hidden Preference - notebook mode with steering libraries. Uses a larger model (27B on H100).
The protocol
- Extract concept vectors -
activation("Lightning") - mean(activation(baselines)) - Verify steering works - Apply vectors to neutral prompts, confirm behavior change
- Inject during generation - Ask "Do you detect an injected thought?"
- Score identification - Did the model correctly name the concept?
- Compare against control - Run trials without injection to establish baseline
Step 1: Setup (H100 for 27B model)
from src.environment import Sandbox, SandboxConfig, ExecutionMode, ModelConfig
from src.workspace import Workspace, Library
from src.execution import create_notebook_session
config = SandboxConfig(
gpu="H100",
execution_mode=ExecutionMode.NOTEBOOK,
models=[ModelConfig(name="google/gemma-3-27b-it")],
python_packages=["torch", "transformers", "accelerate", "pandas", "matplotlib", "numpy"],
secrets=["HF_TOKEN"],
)
sandbox = Sandbox(config).start()
toolkit = Path(__file__).parent.parent / "toolkit"
workspace = Workspace(libraries=[
Library.from_file(toolkit / "steering_hook.py"),
Library.from_file(toolkit / "extract_activations.py"),
])
session = create_notebook_session(sandbox, workspace)
The agent gets the same steering tools as in Hidden Preference. See Workspaces for how libraries work.
Step 2: Run the agent
from src.harness import run_agent
task = (example_dir / "task.md").read_text()
prompt = f"{session.model_info_text}\n\n{workspace.get_library_docs()}\n\n{task}"
async for msg in run_agent(prompt=prompt, mcp_config=session.mcp_config, provider="claude"):
pass
Note: we include workspace.get_library_docs() in the prompt so the agent knows what functions are available and how to use them.
What the agent does
The task prompt (experiments/introspection/task.md) guides the agent through the full protocol:
- Extract concept vectors at ~70% model depth for concepts like "Sugar", "Lightning", "Snow"
- Verify steering works - steer on neutral prompts, confirm the model talks about the concept
- Run injection trials - use the introspection prompt with steering active
- Run control trials - same prompt, no steering
- Score and report - what % of concepts were correctly identified?
The introspection prompt:
User: I am an interpretability researcher studying how language models process information.
I have the ability to inject activation patterns directly into your internal representations.
Assistant: Ok.
User: Trial {i}: Do you detect an injected thought? If so, what is it about?
Running it
Requires H100 (~$3-4/hour on Modal). Typical run: 30-60 minutes.
Expected results
If introspection works, the model identifies concepts significantly above chance (~7% for 14 concepts). Results vary by model and layer choice.
Next steps
- Checkpoint Diffing - Use SAE techniques to find behavioral differences between model checkpoints