| 18 | We have established two independent branches of the code, both designed to exploit the parallel |
| 19 | capabilities of the GPU, although via quite different strategies. The first branch uses a very |
| 20 | straightforward approach, that is the concurrent simulation of particle histories. One working unit |
| 21 | is designated to simulate a particle history from birth to death, following the traditional history based structure. This idea exploits the inherent parallelism of MC tracking, i.e. particle histories |
| 22 | are independent of each other. On the other branch the code is vectorized, meaning that simultaneously executed operations are expected to be identical. While vectorization of the code would not mean much difficulty in case of deterministic methods, MC simulations are ill-suited for this task by the random nature of the process. Preserving a history-based structure is in this case unfeasible, thus an event-based strategy was implemented. A parallel MC calculation is called event-based when only particles undergoing the same event are simulated concurrently. In this case, one working unit is assigned to calculate the outcome of one event in a particle history. The tracking routine first assigns events (e.g. free-flight, fission, elastic scatter) to particles, creating stacks of particles undergoing the same event, then these stacks are processed separately. |
| 23 | |
| 24 | |
| 25 | === Parallel optimization structures |
| 26 | |
| 27 | Currently GUARDYAN runs on a machine containing two Nvidia GeForce GTX 1080 cards, eachwith 8 GBytes of global memory and 5500 GFLOP/s single precision performance accordingto NBody GPU benchmark. The GTX 1080 cards are based on the Pascal architecture and have2560 scalar working units (CUDA cores). These cores can launch warps of 32 concurrent threads,resulting in a theoretical maximum of 81920 parallel working units. The optimal number of concurrentthreads may of course differ due to memory and arithmetic latency considerations. Threadmanagement is implemented by organizing a desired number of threads into blocks, which are requiredto execute independently. This also ensures automatic scalability of the program, as blocksof threads can be scheduled on any multiprocessors of the device, yielding faster execution timewhen more multiprocessors are available. Functions executed in parallel are called kernels in CUDAterminology. Kernels are launched by specifying the number of threads in a block, and the totalnumber of blocks. In general, to choose the number of threads in a block as a multiple of warp size(32) is a good idea, however, CUDA offers an opportunity to maximize kernel performance automatically: calling the cudaOccupancyMaxPotentialBlockSize function for every kernel ensuresoptimal occupancy in terms of arithmetic intensity and memory latency. |
| 28 | |
| 29 | === Memory Management |
| 30 | |
| 31 | CUDA distinguishes six memory types: register, local, shared, texture, constant and global memory.Registers ensure the fastest memory access and are assigned to each thread. Global, constant and texture memory can be accessed by all threads, while the scope of shared memory is only ablock. In exchange, it is much faster. Texture memory is not truly a distinct memory type, it onlylabels a part of global memory that is bound to texture. Textures are implemented with hardwareinterpolation, thus they would be ideal for storing cross section data. But due to random memoryaccess patterns inherent in MC simulations, using cached memory is not advised in this case ,thus cross sections are stored in global memory. A severe limitation for MC applications is the sizeof global memory, e.g. in GUARDYAN, nuclear data for one temperature occupies about 6GB ona card with global capacity of 8GB. Memory transactions between GPU (device) and CPU (host) are carried outthrough reading and writing global memory. As the access of global memory is slow, these transactionscan also take a considerable time, and can have a significant impact on overall performance.Notice, that if the simulation structure is changed (e.g. a history-based algorithm is vectorized),register use, global memory reads and host-device communication will show different behavior,also influencing runtime. Thus the performance gain from vectorization will be obscured. |
| 32 | |
| 33 | |
| 34 | |
| 35 | |