One group of models in this category is the suite of fully multidimensional, flux-corrected transport (FCT) codes. These models exploit the positivity- and monotonicity-preserving properties of FCT to obtain accurate solutions of the compressible, ideal MHD equations. During the preceding HPCC cycle, we developed data-parallel implementations of our latest multidimensional FCT algorithms for hydrodynamics in 2D and 3D, and for MHD in 2D and 2.5D. These codes were developed on the Thinking Machines CM-200 and CM-5 at NRL. Some experimentation was required to optimize the implementation of aperiodic boundary conditions. The most efficient approach was to employ array constructs with appropriate masks, which exploits the CM's very fast circular shift system routine. The resulting CM Fortran code attained a peak speed of just under 4 GFlops on a 256-node CM-5, which is state-of-the-art based on the experience of other CM users at NRL and elsewhere. In addition, we demonstrated the scalability of our implementation with problem size and with the size of the CM partition. Both results are illustrated in Fig. 1, and together demonstrate that we could achieve 15 GFlops sustained performance on a 1024-node CM-5 today.
A second useful point of reference is that our highly parallel, but somewhat communications intensive, CM-5 code running on 32 nodes is about as fast as our highly vectorized, companion Cray code on one C90 processor. Thus, we obtain a speedup of about a factor of 8 - nearly an order of magnitude - on NRL's 256-node CM-5 relative to the C90. This result, together with comparable increases for a 3D relaxation code for force-free magnetic fields and a 3D spectral MHD model, is shown in Fig. 2.
Our FCT modules for solving generalized continuity and hydromagnetic equations in 2D, together with test programs illustrating their use, have been submitted to the HPCC/ESS software exchange in versions for the Cray C90, Thinking Machines CM-200/CM-5, and Cray T3D. This software has been accessed by scientists around the world, and two informal collaborations have been initiated.
Another aspect of our HPCC research is FCT algorithm development. Recently we proposed a new prescription for flux limiting in multidimensional FCT models which improves the performance of the algorithms for both hydrodynamical and MHD systems. The resulting suppression of numerical ripples clarifies significantly the physical small-scale dynamical structures. A second area in which we have pursued new approaches is that of slow-flow MHD systems. Explicit integration schemes such as our FCT algorithms are subject to a numerical timestep limitation determined by the magnetoacoustic speed. In many solar phenomena, however, the important physical timescales are set by the much slower flows, e.g., the shuffling of magnetic footpoints by photospheric convection. To relax this timestep constraint, we treat implicitly only the Alfvenic transport terms. This extends to MHD our successful barely implicit correction (BIC) approach to slow flows in hydrodynamics.
We also have pursued parallelization of another class of finite volume methods, in which the quasi-static evolution of force-free magnetic fields in low-pressure plasmas serves as a model for coronal magnetic structures. One approach represents the force-free field using Clebsch variables, which are iterated to convergence by relaxing the Lorentz force to zero. The mixed second derivatives in the force term give rise to a 27-point stencil in 3D. The predecessor serial code used Gauss-Seidel iteration as the relaxation method. We have adopted in its place a red-black iteration scheme, in which half of the grid point values are updated simultaneously and in parallel at each step. This algorithm has been implemented on the NRL CM-5. Preliminary results agree well with the serial model, and Fig. 2 shows roughly a factor of five speedup with the parallel code. Simulations with up to 512x512x512 grid points have been performed on the NRL CM-5. A slightly different approach that we have pursued uses a vector potential to represent the magnetic field, which has advantages over the Clebsch-variables method when the field contains null points. This model has been implemented on the IBM SP2, with one horizontal plane assigned to each processor and the interprocessor communications handled by message passing. We are progressing now from 64x64x64 grids with this model toward our target of 1024x1024x1024, as the system software improves.
Our parallelization efforts in the last HPCC cycle focused on the 3D code CRUNCH3D, which solves the fully compressible, viscoresistive MHD equations using Fourier-collocation time stepping techniques. The development of this algorithm marked an important advance in MHD turbulence modeling over earlier, incompressible treatments. We converted the vector version of the code to the CM-5 with the assistance of CM specialist D. Norton at NRL. We found it optimal to precompute boundary masks and other static arrays that operate on subsets of the field and store them in memory, as was done for the FCT models.
The greatest improvements in CRUNCH3D were associated with the fast Fourier transform (FFT) on the CM-5. To minimize data movement and the resulting communications costs, the first array dimension is kept entirely within processor, while the other two dimensions are distributed across processors. Operations along the serial axis then are very fast, limited only by the CPU speed of the nodes rather than the data transfer rate between nodes. Operations along the parallel axes are handled by transposing the arrays to bring a distributed dimension into processor and performing a forward FFT in-processor along that dimension. The derivatives in wavenumber space then are computed by multiplication with a static array. A reverse FFT followed by an inverse transpose completes the update along that dimension. The process is then repeated for the second parallel dimension.
Minimizing the number of array transpositions reduced the computation time for the parallel code by over a factor of two, because the data movement consumes 90% of the time required to execute the transpose/FFT combination. We obtained a substantial further reduction by replacing real-to-complex by complex-to-complex FFTs. An additional speedup was achieved by operating on the transformed arrays in their natural, bit-reversed order.
The resulting parallel code currently can be run at a resolution of 256x256x256 Fourier modes on NRL's 256-node CM-5. As shown in Fig. 2, we achieve an order-of-magnitude speedup over a single-processor Cray C90, attaining a total speed of about 4 GFlops and using about 40 s of computer time per step. We routinely run this code at a resolution of 128x128x128 Fourier modes, with each time step requiring about 5 s to execute. Thus, the computation time increases linearly with the number of grid points, demonstrating the scalability of our spectral model. As a result, CRUNCH also could reach 15 GFlops on a present-day, 1024-node CM-5.
Our implementation of the PPPM algorithm on the CM-5, MLGPPPM, is built around the monotonic Lagrangian grid (MLG) particle tracking technique. This technique orders the storage of particle coordinates and other data in machine memory such that particles in physical proximity to each other also are in nearby storage locations. As the particles move, they are shuffled periodically in memory using a sorting process that scales as NlogN, so that the relationship between physical space and memory is retained. We have found that this novel sorting technique, combined with some CM-specific strategies, leads to a highly parallelized algorithm with reduced communications costs compared with older Eulerian methods. The advantages of our CM-5 approach can be expected to convey to other parallel architectures.