Continual Learning — researchengineer.ing

Landscape

The core challenge: neural networks trained sequentially on non-stationary data distributions catastrophically forget previous tasks. The goal is to retain old knowledge while acquiring new, without access to old data.

Sub-areas

Replay-based methods — maintain a buffer of old examples or generate synthetic ones (ER, DER, GDumb)
Regularization-based methods — constrain weight updates to protect important parameters (EWC, SI, MAS)
Architecture-based methods — grow or partition the network (PNNs, PackNet, DyTox)
Meta-learning approaches — learn to learn quickly without forgetting (OML, La-MAML)
Prompt-based methods — condition frozen backbones with task-specific prompts (L2P, DualPrompt)

Landmark papers

Catastrophic Forgetting in Neural Networks — McCloskey & Cohen, 1989. Where the problem was named.
Overcoming Catastrophic Forgetting in Neural Networks (EWC) — Kirkpatrick et al., 2017. The canonical regularization paper.
Progressive Neural Networks — Rusu et al., 2016. Architecture growth approach.
Dark Experience for General Continual Learning (DER) — Buzzega et al., 2020.
Learning to Prompt for Continual Learning (L2P) — Wang et al., 2022. Prompt-based with frozen backbone.

Key figures

Razvan Pascanu, Marc'Aurelio Ranzato, German Ros, Surya Ganguli (theory), David Lopez-Paz.

Open Problems

Is catastrophic forgetting a fundamental limitation or a training artifact? EWC-style approaches assume important parameters can be identified post-hoc. But importance is task-relative. Is there a training objective that avoids this by construction?
What's the right evaluation protocol? Average accuracy hides the forgetting-plasticity tradeoff. No standard benchmark captures the full space of continual learning scenarios.
Can continual learning and few-shot learning be unified? Both involve rapid adaptation from limited data. The separation between them seems artificial.
Does replay actually work at scale? Replay-based methods are empirically strong but require storing examples — violating privacy constraints in many real settings. Generative replay is weaker than hoped. The tension is unresolved.
How does the representation evolve during continual learning? Most work treats the representation as fixed (frozen backbone) or chaotic (full fine-tuning). Understanding how the representation should evolve is underexplored.

Questions & Ideas

Is catastrophic forgetting caused more by representation drift or by weight interference? Can you isolate the two experimentally?
What if you framed continual learning as a compression problem — retain only what can't be reconstructed from the new distribution?
Can a model explicitly learn what to forget as a trained skill, rather than treating forgetting as a failure mode?
Does the geometry of the loss landscape change as tasks accumulate? Is there a topological signature of forgetting?
How much of EWC's success is due to Fisher information, and how much is just the regularization coefficient doing work?
Can replay be made privacy-preserving without resorting to generative models? What's the minimal sufficient statistic of a task?

My Take

This section evolves as thinking develops.

The framing of continual learning as "don't forget" might be the wrong frame. Humans don't remember everything either — selective forgetting is functional. The real goal might be: retain the right things. That shifts the question from "how do we prevent forgetting" to "what should be retained and how do we know?"

The connection to world models is direct: a world model that updates based on new experience but retains its general structure is exactly a continual learner. The RL + continual learning intersection seems underdeveloped.

Journal

2026-02-27 — Setting up this area. My current background: read EWC, DER, and L2P. Have a rough map of the landscape. The open question I keep returning to: is the problem fundamentally about representations or about optimization? The regularization vs. replay debate maps onto this — EWC says "protect weights," replay says "rehearse data." Neither says "learn better representations."

Next: survey the architecture-based methods (PackNet, DyTox) and then look at the theory papers from Ganguli's group.