Collection of CUDA Kernel Optimization Links

Aug 17, 2020

Here is a collection of links that Yohann Dudouit provided to me when I began optimizing the Jacobian assembly kernel.

What is a Roofline Model? Click here to find out!
Some general optimization links:
You can look at the stall reasons:
You can use nvprof with:
- stall_constant_memory_dependency for Percentage of stalls occurring because of immediate constant cache miss
- stall_exec_dependency for Percentage of stalls occurring because an input required by the instruction is not yet available
- stall_inst_fetch for Percentage of stalls occurring because the next assembly instruction has not yet been fetched
- stall_memory_dependency for Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
- stall_memory_throttle for Percentage of stalls occurring because of memory throttle
- stall_not_selected for Percentage of stalls occurring because warp was not selected
- stall_other for Percentage of stalls occurring due to miscellaneous reasons
- stall_pipe_busy for Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
- stall_sync for Percentage of stalls occurring because the warp is blocked at a __syncthreads() call
- stall_texture for Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
Other resources:

The MFEM team has recently written a nice little document containing many tips for optimizing GPU kernels in this post.

Last updated: 4/11/2021