Here is a collection of links that Yohann Dudouit provided to me when I began optimizing the Jacobian assembly kernel.

  • What is a Roofline Model? Click here to find out!
  • Some general optimization links:
  • You can look at the stall reasons:
  • You can use nvprof with:

    • stall_constant_memory_dependency for Percentage of stalls occurring because of immediate constant cache miss
    • stall_exec_dependency for Percentage of stalls occurring because an input required by the instruction is not yet available
    • stall_inst_fetch for Percentage of stalls occurring because the next assembly instruction has not yet been fetched
    • stall_memory_dependency for Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
    • stall_memory_throttle for Percentage of stalls occurring because of memory throttle
    • stall_not_selected for Percentage of stalls occurring because warp was not selected
    • stall_other for Percentage of stalls occurring due to miscellaneous reasons
    • stall_pipe_busy for Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
    • stall_sync for Percentage of stalls occurring because the warp is blocked at a __syncthreads() call
    • stall_texture for Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
  • Other resources:

The MFEM team has recently written a nice little document containing many tips for optimizing GPU kernels in this post.

Last updated: 4/11/2021