Brainstorm Board
Capture and vote on research ideas, hypotheses, and design directions.
Use INT4 mixed-precision quantization
Apply INT4 for weights and INT8 for activations to reduce model size by 2x while maintaining accuracy within 2% of FP32 baseline.
Tile-based attention computation
Partition the attention matrix into tiles that fit in on-chip BRAM to avoid expensive DRAM accesses during the softmax computation.
Explore Swin Transformer for local attention
Swin's window-based attention has O(n) complexity vs O(n²) for standard ViT. Could significantly reduce hardware resource requirements.
Pipelined HLS design for FFN layers
Use HLS PIPELINE pragma with II=1 to fully pipeline the feed-forward network layers, maximizing throughput.
Double-buffering for weight loading
Pre-fetch the next layer's weights while computing the current layer to hide DRAM latency.
Compare ZCU104 vs Alveo U250 targets
Evaluate whether the embedded ZCU104 or the datacenter Alveo U250 better fits our latency/power budget.