đź§  When Lazy Kernels Hang - A Quirky Tale of CUDA, Streams, and Warmups

Summary:
Ever had your CUDA kernels mysteriously hang, even though everything looked fine? You’re not alone. This post walks through a deceptively simple code snippet that deadlocks — and explains how lazy loading, asynchronous streams, and cold GPUs all conspire to make benchmarking and debugging… interesting. We’ll break down what happens, why it matters, and how to keep your GPU pipelines warm and humming.

Read more

Upcoming - in no order

  • Benchmarking setup
  • Concurrency
  • Matmul
    • nsight compute
  • nsight sanitizer
  • Tensor cores
    • mma
    • wmma
    • wgmma + TMA
  • torch.compile
  • MoE : FlashDMoE
  • Reduction
  • Prefix Scan
  • Flashattention + FlashMLA
  • Model Parallelism or Distributed training/inference
    • FSDP
    • Expert Parallelism
    • Context Parallelism
    • Sequence Parallelism
    • Pipeline Parallelism
    • 4D Parallelism

🔍 Know thy GPU - A Fun Dive into CUDA Device Introspection

Ever wondered what your GPU is made of? I don’t mean physically (though that would make a great teardown video) — I mean capability-wise. If you’re working with CUDA, it’s crucial to know whether your GPU supports managed memory, tensor cores, or concurrent kernel execution. And hey, maybe you’re just trying to settle a bet about whose card is faster. 🏎️

In this post, we’ll go on a quick and entertaining tour through a powerful C++ tool that queries all your CUDA-capable GPUs and tells you everything from warp size to peak memory bandwidth. Buckle up!

Read more

Hello World

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment