LLMs are not the Black Box you were promised
Jay Hack · Mar 2026An adaptation of a thread on Anthropic's "On the Biology of a Large Language Model." It walks through how mechanistic interpretability and circuit tracing let us decompose a model's activations into human-interpretable features, watch genuine multi-step reasoning unfold (Dallas → Texas → Austin), and uncover quirks like Claude's non-human integer-addition algorithm — arguing that LLMs are far less of a black box than we were promised.




