Collection of CUDA Kernel Optimization Links
Here is a collection of links that Yohann Dudouit provided to me when I began optimizing the Jacobian assembly kernel.
- What is a Roofline Model? Click here to find out!
- Some general optimization links:
- You can look at the stall reasons:
-
You can use
nvprof
with:stall_constant_memory_dependency
for Percentage of stalls occurring because of immediate constant cache missstall_exec_dependency
for Percentage of stalls occurring because an input required by the instruction is not yet availablestall_inst_fetch
for Percentage of stalls occurring because the next assembly instruction has not yet been fetchedstall_memory_dependency
for Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstandingstall_memory_throttle
for Percentage of stalls occurring because of memory throttlestall_not_selected
for Percentage of stalls occurring because warp was not selectedstall_other
for Percentage of stalls occurring due to miscellaneous reasonsstall_pipe_busy
for Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busystall_sync
for Percentage of stalls occurring because the warp is blocked at a__syncthreads()
callstall_texture
for Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
- Other resources:
The MFEM team has recently written a nice little document containing many tips for optimizing GPU kernels in this post.
Last updated: 4/11/2021