![]() Besides the NVTX resource naming, everything described here works equally well with MPI+OpenACC applications. Conclusionįollowing the above approach many performance issues of MPI+CUDA applications can be identified with NVIDIA tools and NVTX can be used to improve working with these profiles. These tools use our profiling interface CUPTI to assess MPI+CUDA applications and also offer advanced support to detect MPI and CPU related performance issues. To collect application traces and analyze the performance of MPI applications, well established and much more sophisticated tools like Score-P, Vampir or TAU exists. To guarantee that cuCtxGetCurrent picks the right context, it’s required that a CUDA Runtime call is made between the calls to cudaSetDevice and cuCtxGetCurrent. NvtxNameCuContextA( ctx, name ) NVVP time line with named OS thread and CUDA context showing the GPU activity of two MPI processes. Instead of naming the CUDA devices it’s also possible to name the GPU context: char name NvtxNameCudaDeviceA(rank, name) NVVP time line with named OS thread and CUDA device showing the GPU activity of two MPI processes. Before CUDA 7.5 you can achieve the same result by using NVTX explicitly from your application: char name With CUDA 6.5, the string “ %q” as a parameter. ![]() nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “ %p” in the output file name. Nvprof supports dumping the profile to a file which can be later imported into nvvp. Profiling MPI applications with nvprof and nvvp Collecting data with nvprof In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler ( nvvp). A solution which solves this for the command-line output of nvprof is described here. Although the mapping can be done manually, for example for OpenMPI via the command-line option -display-map, it’s tedious and error prone. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. While host and target are often the same machine, the target can also be a. The user launches the NVIDIA Nsight Compute frontend (either the UI or the CLI) on the host system, which in turn starts the actual application as a new process on the target system. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. When profiling an application with NVIDIA Nsight Compute, the behavior is different. When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |