
Vision-Language Asymmetry in Bistable Image Captioning
Sparse autoencoders show that LLaVA-1.6’s vision encoder represents both aspects of bistable images simultaneously while the language decoder commits to one. Causal steering localizes the seeing-as bottleneck to the language model.
