Gradient Hacking: Definitions and Examples
AI Safety Fundamentals: Alignment - A podcast by BlueDot Impact
Categories:
Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).The model uses that knowledge to influence its medium-term training trajectory, even if the effects wash out in the long term.Below I give some potential examples of gradient hacking...