A Weekend on Neural Network Interpretability

3 minute read

This is a summary of my experience of the weekend on interpretability I spent from 2023-05-27 morning to 2023-05-28 evening. This was organized in Paris by EffiSciences.

Saturday Morning

We got an introduction to AI risk and the idea of interpretability was introduced as a possible way to counter Goal Misgeneralization, whereby an AI can end up pursuing a goal that is different from the one that was intended, because it found a proxy to the “right solution”. This implies that if the proxy behaves differently in deployment than in training, the agent will still just follow the proxy even if this yields bad outcomes.

Interpretability means figuring out what is going on inside a neural network in terms of what information it retains as relevant to taking a decision. For example, In Feature visualization the authors present a method to get an example of an image that highly excites a particular neuron; this supposedly tells us what human-intelligible concept this neuron “recognizes”. This is reminiscent of the well-critiqued “grandmother neuron” theory of representation of concepts in the human brain.

Then we went into a particular interpretability technique named GradCam. This method allows you to see which pixels are relevant to a particular input-output (e.g. image-class) pair.

In a workshop, we filled out a notebook implementing GradCam. In particular, I was made aware of the Einstein summation notation, implemented in the form of the einops Python package which is an elegant way to express complex linear tensor operations in few symbols.

According to Adebayo et al., GradCam is more reliable than other feature importance methods for images; the other methods seem closer to edge detectors.


Saturday Afternoon

We got a quick introduction to Transformers as a core part of the architecture of many current state-of-the-art models. As my former internship supervisor said, AI practitioners tend to use it a lot, sometimes excessively:

Flex-tape meme: AI engineers slap Transformers onto literally every task they come across.

We covered the attention block architecture, discussed the concept of residual stream and positional embedding. By that point in the day my brain was fried so things like the terminology of “queries”, “keys” and “values” used when describing the attention mechanism felt a bit esoteric.


Sunday Morning

We went over Transformers once more, then we were explained the LogitLens method. It’s much simpler than to explain GradCam but I’m less sure that it’s really believable. We spent the morning implementing it (by adding hooks onto a Hugging Face pre-trained GPT-2 model).


Sunday Afternoon

We went over many (I’d say too many) papers on interpreting Transformers. Once again was brain was fried pretty quickly and my knowledge of transformers was too fresh to really get anything of what was going on. However, I do understand that attention blocks seem interpretable and there are some connections with linear algebra (e.g. the query-keys product is invariant by rotation of the two matrices, and something about working only in small subspaces of some big space).

This illustration is considered not bad, although you still have to ponder on it for a while.

We discussed the concern that interpretability itself is likely not a panacea but rather a nice, “cool” subject that can help to introduce ML practitioners to AI safety research.


Takeaway

Further Readings

SERI MATS streams

  • Aligning Language Models
  • Agent Foundations
  • Consequentialist Cognition and Deep Constraints
  • Cyborgism
  • Deceptive AI
  • Evaluating Dangerous Capabilities
  • Interdisciplinary AI Safety
  • Mechanistic Interpretability
  • Multipolar AI Safety
  • Powerseeking in Language Models
  • Shard Theory
  • Understanding AI Hacking

Questions

  • Is AI really that dangerous? As a student having taken courses on cryptography and privacy, I always find it surprisingly difficult to properly argue for the importance of privacy and safety with peers and practitioners who might be simply oblivious to these issues. The AGI safety fundamentals could help give me arguments there?
  • What is “grokking”?

Final Thoughts

This was my first “formal” introduction to AI safety apart from hearing people talk about it around me and watching Rob Miles videos. Although the possibility of the future of humanity is supposedly at stake, my goal was just to get into some technical details. In the meantime I got to actually talk with very smart people that have been thinking about these issues for much longer than I have.

This is post number 003 of #100daystooffload.