Research Resources
Curated papers, videos, and tutorials on Vision Transformers and FPGA acceleration.
The original Vision Transformer (ViT) paper. Demonstrates that pure transformer architectures can match or exceed CNNs on image classification when trained on large datasets.
Pioneering FPGA framework for binarized neural networks. Key reference for understanding how to map neural network operations to FPGA dataflows.
Shows ViT can be trained efficiently on ImageNet alone using knowledge distillation, making it practical without massive pre-training datasets.
Introduces shifted window attention for linear complexity, enabling ViT to scale to high-resolution images and dense prediction tasks.
Essential survey covering DNN hardware acceleration fundamentals: dataflows, memory hierarchies, and hardware architectures for efficient inference.
Comprehensive survey of neural network quantization techniques including post-training quantization, quantization-aware training, and mixed-precision methods.
Practical walkthrough of deploying neural networks on Xilinx FPGAs using Vitis AI, covering quantization, compilation, and board deployment.
Official Vitis HLS documentation covering pragmas, directives, dataflow optimization, and best practices for high-performance HLS design.
Recent work mapping transformer-based LLMs to FPGAs with custom sparse computation and memory optimization. Highly relevant to our ViT acceleration work.
MIT course covering efficient deep learning: pruning, quantization, knowledge distillation, and hardware-aware neural architecture search.
Directly relevant: proposes an automated framework for ViT acceleration on FPGAs with mixed-precision quantization and hardware-aware NAS.
Professor Suggested Readings
a CNN accelerator that uses a dataflow pipeline (done in Professor Betz's group, and the latest in a few related works), this accelerator style will be harder to make work for a transformer though.
The latest version of the NPU (again in Professor Betz's group, and you can see earlier publications in its references). It is an overlay architecture, meaning a very customized and big soft processor. This style is more likely to work well for a transformer.
Here's a (fairly advanced) lecture on vision transformers, by Song Han at MIT. He has lots of interesting work on efficient machine learning.