27 | | Currently GUARDYAN runs on a machine containing two Nvidia GeForce GTX 1080 cards, eachwith 8 GBytes of global memory and 5500 GFLOP/s single precision performance accordingto NBody GPU benchmark. The GTX 1080 cards are based on the Pascal architecture and have2560 scalar working units (CUDA cores). These cores can launch warps of 32 concurrent threads,resulting in a theoretical maximum of 81920 parallel working units. The optimal number of concurrentthreads may of course differ due to memory and arithmetic latency considerations. Threadmanagement is implemented by organizing a desired number of threads into blocks, which are requiredto execute independently. This also ensures automatic scalability of the program, as blocksof threads can be scheduled on any multiprocessors of the device, yielding faster execution timewhen more multiprocessors are available. Functions executed in parallel are called kernels in CUDAterminology. Kernels are launched by specifying the number of threads in a block, and the totalnumber of blocks. In general, to choose the number of threads in a block as a multiple of warp size(32) is a good idea, however, CUDA offers an opportunity to maximize kernel performance automatically: calling the cudaOccupancyMaxPotentialBlockSize function for every kernel ensuresoptimal occupancy in terms of arithmetic intensity and memory latency. |
| 27 | Currently GUARDYAN runs on a machine containing two Nvidia GeForce GTX 1080 cards, each with 8 GBytes of global memory and 5500 GFLOP/s single precision performance accordingto NBody GPU benchmark. The GTX 1080 cards are based on the Pascal architecture and have 2560 scalar working units (CUDA cores). These cores can launch warps of 32 concurrent threads,resulting in a theoretical maximum of 81920 parallel working units. The optimal number of concurrent threads may of course differ due to memory and arithmetic latency considerations. Thread management is implemented by organizing a desired number of threads into blocks, which are required to execute independently. This also ensures automatic scalability of the program, as blocks of threads can be scheduled on any multiprocessors of the device, yielding faster execution time when more multiprocessors are available. Functions executed in parallel are called kernels in CUDA terminology. Kernels are launched by specifying the number of threads in a block, and the total number of blocks. In general, to choose the number of threads in a block as a multiple of warp size(32) is a good idea, however, CUDA offers an opportunity to maximize kernel performance automatically: calling the ''cudaOccupancyMaxPotentialBlockSize'' function for every kernel ensures optimal occupancy in terms of arithmetic intensity and memory latency. |
31 | | CUDA distinguishes six memory types: register, local, shared, texture, constant and global memory.Registers ensure the fastest memory access and are assigned to each thread. Global, constant and texture memory can be accessed by all threads, while the scope of shared memory is only ablock. In exchange, it is much faster. Texture memory is not truly a distinct memory type, it onlylabels a part of global memory that is bound to texture. Textures are implemented with hardwareinterpolation, thus they would be ideal for storing cross section data. But due to random memoryaccess patterns inherent in MC simulations, using cached memory is not advised in this case ,thus cross sections are stored in global memory. A severe limitation for MC applications is the sizeof global memory, e.g. in GUARDYAN, nuclear data for one temperature occupies about 6GB ona card with global capacity of 8GB. Memory transactions between GPU (device) and CPU (host) are carried outthrough reading and writing global memory. As the access of global memory is slow, these transactionscan also take a considerable time, and can have a significant impact on overall performance.Notice, that if the simulation structure is changed (e.g. a history-based algorithm is vectorized),register use, global memory reads and host-device communication will show different behavior,also influencing runtime. Thus the performance gain from vectorization will be obscured. |
| 31 | CUDA distinguishes six memory types: register, local, shared, texture, constant and global memory.Registers ensure the fastest memory access and are assigned to each thread. Global, constant and texture memory can be accessed by all threads, while the scope of shared memory is only a block. In exchange, it is much faster. Texture memory is not truly a distinct memory type, it only labels a part of global memory that is bound to texture. Textures are implemented with hardware interpolation, thus they would be ideal for storing cross section data. But due to random memory access patterns inherent in MC simulations, using cached memory is not advised in this case ,thus cross sections are stored in global memory. A severe limitation for MC applications is the sizeof global memory, e.g. in GUARDYAN, nuclear data for one temperature occupies about 6GB ona card with global capacity of 8GB. Memory transactions between GPU (device) and CPU (host) are carried out through reading and writing global memory. As the access of global memory is slow, these transactions can also take a considerable time, and can have a significant impact on overall performance.Notice, that if the simulation structure is changed (e.g. a history-based algorithm is vectorized),register use, global memory reads and host-device communication will show different behavior,also influencing runtime. Thus the performance gain from vectorization will be obscured. |
| 32 | |
| 33 | === Thread Divergence |
| 34 | |
| 35 | When implementing a MC neutron transport code special attention must be paid to choosing the right kernels, as the slowest working thread will determine the efficiency of parallelization. All other threads in a warp must wait for the thread finishing last. Loops, conditional and branching statements lead to thread divergence, an uneven distribution of work-load. This issue is targeted by the vectorization of the code, i.e. the event-based Monte Carlo simulation. In event-based GUARDYAN this is implemented by distinguishing kernel functions for different types of events instead of just one ”big” kernel (as in the history-based GUARDYAN). When calling these kernels,threads of a warp executing the same operations on particles (the same event is simulated) do not branch, whereas in a history-based simulation threads may easily diverge as one particle may be in a transition step while the other scatters or induces fission. Branching statements in a warp are executed serially in CUDA: an if-else statement is executed for all warps in two cycles (both branch is executed one after the other). When the ”if” branch is executed, the threads that do not satisfy the condition (would diverge to the ”else” branch) are flagged and perform a NOP (no operation).This results in the degradation of parallel performance. However, it does not necessarily mean that the vectorized version of the code will execute faster. One reason was given in the previous section, regarding memory management issues. Neither should we neglect that the compiler also does some optimization to reduce the penalty due to thread divergence [13], among these are wellknown tools like warp-voting and predication, but CUDA uses optimization tools that are hidden from the programmer. |
| 38 | == Event based and History based approach comparison: Test Case '''1''' |
| 39 | GUARDYAN has been recently validated against MCNP6 in a simplified setup assuming a monoenergetic neutron source inside a homogeneous sphere. 4 10^6^ neutrons were launched at energies 0.01eV , 1eV , 1keV , 1MeV and 18MeV , and tracked until either leaking out of the sphere or exceeding time boundary. The simulation was carried out for 412 isotopes and was used for validation of the code comparing the spectrum of leaking neutrons to MCNP6 results. Cross section library ENDF/B-VII.1 was used assuming temperature of 293:6K. Regarding our investigation of event-based tracking, wall-time was measured for both history-based (TH) and event-based (TE) simulations. In Fig. 2, histograms of simulation speedup are plotted for all starting energies. Speedup is simply defined by |
| 40 | |
| 41 | Speedup= TE/TH |
| 42 | |
| 43 | i.e. the ratio of wall-times. Fig. 2 shows that vectorization of the code resulted in faster execution time in most cases. Typical speedup was around 1:5-2, but longer simulation time was observed mainly when starting energy is below 1MeV . |
| 44 | |
| 45 | |
| 46 | The efficiency loss was experienced in case of isotopes with high probability for fission around the starting energy. When the starting energy is low, neutrons released in fission take on much higher velocity than starters, thus leaking out of the system very fast. As a result, significant part of computational effort was spent on a few neutrons bouncing around in the system. Population drop caused vectorization gain to be cancelled due to the computational overhead of event-based tracking (particles need to be sorted by event type). On higher starting energies, no considerable speedup was observed in case of elements with low atomic numbers, the improvement from vectorization was more expressed when heavy elements were present. This is due to that the outgoing energy and angle of a neutron scattered on a light isotope are derived by simple laws of collision mechanics, while more complicated energy laws are applied when heavier isotopes are present . In GUARDYAN beside elastic scatter only ACE law 3 (inelastic discrete-level scattering) was used in the former case, and ACE law 4 (and 44) was additionally used in the latter. ACE law 4 represents a continuous tabular distribution, the outgoing energy is given as a probability distribution for every incoming energy. This sampling procedure takes considerably more time, contributing to thread divergence, and resulting in substantial efficiency boost for event-based tracking. |