ORAC implements a weak scaling parallel algorithm via MPI calls in
H-REM/SGE generalized ensemble simulations or driven simulations
technologies based on the Crooks theorem. and a
strong scaling parallel approach on the OpenMP layer based on particle
decomposition for the computation of the forces. The nature of the
executable (serial, MPI only and MPI/OpenMP with or without FFTW
libraries) is straightforwardly controlled by the specification of the
options of the configure command, as described above.
Parallel execution can therefore be done on either MPI or OpenMP level
or on the two combined levels using the appropriate target
executable. ORAC is designed to maximize the sampling efficiency of a
single MD simulation job on a NUMA multicore platforms using advanced
highly parallel techniques, such as H-REM or SGE, or even
zero-communication embarrassingly parallel approach like FS-DAM. We
deem as ``thermodynamic'' the parallelism related to such weakly
scaling algorithms. Thermodynamic parallelism in ORAC is handled only
at the distributed memory MPI level, with a minimal incidence of the
inter-replica (for H-REM and SGE simulations) communication overhead
or no overhead at all for the NE independent trajectories in FS-DAM
simulations. Therefore, the number of MPI processes in ORAC (defined via
the mpiexec command) simply equals the number of replica/walkers
in H-REM/SGE simulations or the number of independent NE trajectories
in a FS-DAM job.
By default, ORAC uses a fixed (static) number of OpenMP threads, nthr, for all asynchronous force computations. This number,
returned by a standard run time library routine, corresponds to the
number of cores found on the compute node. This is true also when on
the same compute node are running more than one MPI instance related,
so that the node gets overloaded with more OpenMP threads than cores.
In order to tame the loss of parallel efficiency due thread
overloading and/or granularity issues produced by the default
behavior, ORAC can handle the disparate granularity of the various
force computation in the nested MTS levels using an end-user
controlled dynamic adjustment of the OpenMP threads
numbers. Intramolecular forces, implementation of constraints,
step-wise advancement of velocities/coordinates and computation of non
bonded forces in the direct and reciprocal lattice can each be
computed with a user controlled number of threads by switching on
dynamic threading.
Figure:
Orac input header for dynamic threading and cache line size
control. Cache line size is expressed in units of 8 bytes words. The
``T'' following the shebang-like sequence ``#SPMamp;'' is
optional. When provided, the program produces detailed timing of the
various parallel computations.
|
From the end user point of view, dynamic threading
can be specified by supporting the optional heading reported in the
figure 11.1 in the main input file. In this heading, the
shebang-like character sequence ``#SPMamp;'' instructs an OpenMP
compiled executable to implement dynamic threading with up to three
OpenMP levels: main level, for non bonded forces and fast Fourier
transform of the gridded charge array in PME computation; level-1, for
fast forces (improper torsion and bending) and step-wise advancement
in all MTS levels; level-2 for bond constraints, proper torsion and direct lattice
Ewald intramolecular corrections. In the same heading, the user can
also control the cache-line size to be used in the various previously
defined OpenMP levels. Cache line size control in OpenMP applications
is essential in minimizing the impact of cache misses due to
``false-sharing''. The latter may occur in the reduction operation of
the force arrays involved in particle decomposition parallel
algorithm, when threads on different processors modify variables that
reside on the same cache line, thus invalidating the cache line for
all processors (cache misses) and hurting the parallel
performances. Cache misses clearly depends on the length of the cache
line that is architecture dependent. Common cache line sizes are 32,
64 and 128 bytes. To keep up with the rapidly evolving multicore
architectures, in the compact main input heading reported in
Figure 11.1, ORAC provides the possibility of specifying the
actual cache line size on the underlying compute node, thus
guaranteeing cache coherence of force reduction operations. The
OpenMP optional heading is interpreted as a simple comment by
executables that were compiled for serial execution, thus preserving
the retro-compatibility of the input file for the non OpenMP past releases.
When launched in an MPI environment, ORAC creates in the directory
from which it was launched, nprocs PARXXXX new directories
where the main
input file is copied and all output
of the replicas are written. N.B.: existing PARXXXX directory
are overwritten when launching the code a second time. This can be
avoided by changing the ``PAR'' to a user-defined string in the
main input file by specifying:
#!&string
... #input follows
The only two files that need to be
in the directory from which ORAC is launched are the main input and
the REM.set file (only if the a REM simulation is started from
scratch and the scaling factors of the replicas are assigned manually
and not automatically (see SETUP(&REM)).
procacci
2021-12-29