ann
spaceblog

GPU

2025-05-22•2 min read

Prerequisites

  1. C Programming Language, 2nd Edition
  2. GPU Computing
  3. Parallel Computing Stanford CS149
  4. GPU Programming by Simon Oz

CUDA

  1. CUDA C++ Programming Guide
  2. Parallel computing using C
  3. CUDA tutorial code samples
  4. CUDA book archive by NVIDIA
  5. UIUC CUDA course
  6. Programming in Parallel with CUDA (personal todo: ch6 & ch11)
  7. Optimize a CUDA Matmul Kernel for cuBLAS-like Performance
  8. Techniques from AMD $100K kernel competition:
    • ColorWinds Grand Prize winning kernel
    • Luong The Cong's FP8 Quant MatMul
    • Seb V's Fast GPU Matrix Implementation
    • Akash Karnatak's Challenge Solutions
    • Fan Wenjie Technical Analysis
    • Snektron's FP8 Matrix Multiplication

Triton

  1. Triton docs
  2. k resources repo by remek
  3. Practioner guide to Triton
  4. Triton internals
  5. Reverse engineering Triton to CUDA

Misc

  1. ThunderKittens and starter guide
  2. TileLang
  3. GPU Glossary
  4. GPU goes brr (nice blog on gpu architecture)
  5. How to Accurately Time CUDA Kernels in Pytorch
  6. How cuda programming works
  7. Outperforming cuBLAS on H100
  8. Memory Coalescing and Tiled Matrix Multiplication
  9. Tensor core programming
  10. CUDA MatMul

GH

  1. 100 days of building GPU kernels by hamdi
  2. 120 days of cuda
  3. cuda challenge by 1y33
  4. leetcuda