WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
Vision Hopfield Memory Networks: Bridging Associative Memory and Predictive Coding for Data-Efficient Vision
总结
问题
方法
结果
要点
摘要

The Vision Hopfield Memory Network (V-HMN) is a brain-inspired vision backbone that replaces self-attention and traditional token-mixing with hierarchical associative memory modules. It achieves state-of-the-art performance on CIFAR and SVHN benchmarks while demonstrating superior data efficiency and interpretability compared to Transformers and Mamba-based models.

Executive Summary

TL;DR: V-HMN is a novel vision backbone that abandons the standard self-attention paradigm in favor of associative memory retrieval. By utilizing Local and Global Hopfield modules and a predictive-coding-inspired refinement loop, it achieves better performance with significantly less data and provides a transparent look into the "prototypes" the model uses for inference.

Background: While Transformers (ViT) and State-Space Models (Mamba) dominate the SOTA leaderboards, they remain computationally expensive and biologically distal. V-HMN positions itself as a brain-inspired alternative, focusing on how the human brain uses memory to "fill in the blanks" and correct perception through feedback loops.

Problem & Motivation: The Data-Hunger of Modern Vision

Modern deep learning models are essentially "feedforward machines" that require millions of samples to generalize. They lack:

  1. Biological Plausibility: They don't use recurring error-correction like the human visual cortex.
  2. Data Efficiency: Without strong inductive biases, they struggle in low-data regimes.
  3. Interpretability: It is difficult to see which "stored concepts" a ViT is referencing during a specific classification.

The authors' insight is to treat vision not just as a sequence of transformations, but as an iterative retrieval process.

Methodology: Memory as the Core Primitive

V-HMN replaces the "mixer" in a standard block with two memory paths:

  • Local Window Memory: Operates on $k imes k$ neighborhoods to denoise and complete local textures/edges.
  • Global Template Memory: Uses a scene-level query to retrieve a semantic "prior" (e.g., "this is a ship in the ocean") to modulate all tokens.

The Iterative Refinement Rule

Instead of a single pass, V-HMN uses a Predictive Coding (PC) update: $$z^{(t+1)} = z^{(t)} + \beta (m - z^{(t)})$$ Here, $m$ is the retrieved prototype (the prediction) and $(m - z)$ is the prediction error. The model "shoves" the current representation closer to its memory of what it should look like.

Model Architecture Figure 1: The V-HMN Architecture highlighting the Local and Global memory pathways.

Experiments & Results: Slaying the Baseline

V-HMN was tested against heavyweights like ViT, Swin, MLP-Mixer, and Vision Mamba (Vim).

1. Robustness & Iteration

The ablation study on $t$ (refinement steps) proved that doing "mental work" helps. Moving from $t=0$ (feedforward) to $t=2$ significantly boosted accuracy, especially under noise and occlusion.

Refinement Impact Figure 2: Performance gains across various corruption types as refinement iterations increase.

2. Extreme Data Efficiency

In the 10% data regime, V-HMN crushed other models. While ViT dropped to 72.73% on CIFAR-10, V-HMN held strong at 80.22%. This suggests the stored prototypes act as a powerful anchor when labels are scarce.

3. Seeing the Memory

One of the coolest features of V-HMN is visualization. We can actually see the "prototypes" the model retrieves. If the model sees a "Deer," we can see it's matching patch-level textures to stored deer-leg and deer-head prototypes.

Prototype Visualization Figure 3: Semantic alignment between input patches and retrieved memory prototypes.

Critical Analysis & Conclusion

Takeaways

V-HMN proves that Memory-Centric architectures are viable. By making stored patterns explicit (in a ring buffer/memory bank) rather than hiding them in dense weights, we gain both efficiency and trust.

Limitations

  • Memory Overhead: Managing large memory banks (e.g., 5000 slots for ImageNet) adds a different kind of complexity compared to weights.
  • Scaling: While competitive on ImageNet-1k, the "frozen memory" at inference might need more dynamic strategies for truly open-world, billion-scale datasets.

Future Outlook

This work opens the door for multimodal backbones where a single memory bank could store "concept prototypes" shared across text, audio, and vision, mimicking the associative nature of the human hippocampus.

发现相似论文

试试这些示例

  • Search for recent papers that replace Self-Attention with Modern Hopfield Networks in vision or multimodal foundation models.
  • Which study first introduced the integration of Predictive Coding error-correction loops into Transformer-based architectures?
  • Explore the applications of associative memory and prototype-based retrieval in low-data or long-tailed image classification tasks.
目录
Vision Hopfield Memory Networks: Bridging Associative Memory and Predictive Coding for Data-Efficient Vision
1. Executive Summary
2. Problem & Motivation: The Data-Hunger of Modern Vision
3. Methodology: Memory as the Core Primitive
3.1. The Iterative Refinement Rule
4. Experiments & Results: Slaying the Baseline
4.1. 1. Robustness & Iteration
4.2. 2. Extreme Data Efficiency
4.3. 3. Seeing the Memory
5. Critical Analysis & Conclusion
5.1. Takeaways
5.2. Limitations
5.3. Future Outlook