Parallel version

ORAC implements a weak scaling parallel algorithm via MPI calls in H-REM/SGE generalized ensemble simulations or driven simulations technologies based on the Crooks theorem. and a strong scaling parallel approach on the OpenMP layer based on particle decomposition for the computation of the forces. The nature of the executable (serial, MPI only and MPI/OpenMP with or without FFTW libraries) is straightforwardly controlled by the specification of the options of the configure command, as described above. Parallel execution can therefore be done on either MPI or OpenMP level or on the two combined levels using the appropriate target executable. ORAC is designed to maximize the sampling efficiency of a single MD simulation job on a NUMA multicore platforms using advanced highly parallel techniques, such as H-REM or SGE, or even zero-communication embarrassingly parallel approach like FS-DAM. We deem as ``thermodynamic'' the parallelism related to such weakly scaling algorithms. Thermodynamic parallelism in ORAC is handled only at the distributed memory MPI level, with a minimal incidence of the inter-replica (for H-REM and SGE simulations) communication overhead or no overhead at all for the NE independent trajectories in FS-DAM simulations. Therefore, the number of MPI processes in ORAC (defined via the mpiexec command) simply equals the number of replica/walkers in H-REM/SGE simulations or the number of independent NE trajectories in a FS-DAM job.

By default, ORAC uses a fixed (static) number of OpenMP threads, nthr, for all asynchronous force computations. This number, returned by a standard run time library routine, corresponds to the number of cores found on the compute node. This is true also when on the same compute node are running more than one MPI instance related, so that the node gets overloaded with more OpenMP threads than cores. In order to tame the loss of parallel efficiency due thread overloading and/or granularity issues produced by the default behavior, ORAC can handle the disparate granularity of the various force computation in the nested MTS levels using an end-user controlled dynamic adjustment of the OpenMP threads numbers. Intramolecular forces, implementation of constraints, step-wise advancement of velocities/coordinates and computation of non bonded forces in the direct and reciprocal lattice can each be computed with a user controlled number of threads by switching on dynamic threading.

**Figure:** Orac input header for dynamic threading and cache line size control. Cache line size is expressed in units of 8 bytes words. The ``T'' following the shebang-like sequence ``#SPMamp;'' is optional. When provided, the program produces detailed timing of the various parallel computations.
$\includegraphics[scale=0.6]{header.eps}$

From the end user point of view, dynamic threading can be specified by supporting the optional heading reported in the figure 11.1 in the main input file. In this heading, the shebang-like character sequence ``#SPMamp;'' instructs an OpenMP compiled executable to implement dynamic threading with up to three OpenMP levels: main level, for non bonded forces and fast Fourier transform of the gridded charge array in PME computation; level-1, for fast forces (improper torsion and bending) and step-wise advancement in all MTS levels; level-2 for bond constraints, proper torsion and direct lattice Ewald intramolecular corrections. In the same heading, the user can also control the cache-line size to be used in the various previously defined OpenMP levels. Cache line size control in OpenMP applications is essential in minimizing the impact of cache misses due to ``false-sharing''. The latter may occur in the reduction operation of the force arrays involved in particle decomposition parallel algorithm, when threads on different processors modify variables that reside on the same cache line, thus invalidating the cache line for all processors (cache misses) and hurting the parallel performances. Cache misses clearly depends on the length of the cache line that is architecture dependent. Common cache line sizes are 32, 64 and 128 bytes. To keep up with the rapidly evolving multicore architectures, in the compact main input heading reported in Figure 11.1, ORAC provides the possibility of specifying the actual cache line size on the underlying compute node, thus guaranteeing cache coherence of force reduction operations. The OpenMP optional heading is interpreted as a simple comment by executables that were compiled for serial execution, thus preserving the retro-compatibility of the input file for the non OpenMP past releases.

When launched in an MPI environment, ORAC creates in the directory from which it was launched, nprocs PARXXXX new directories where the main input file is copied and all output of the replicas are written. N.B.: existing PARXXXX directory are overwritten when launching the code a second time. This can be avoided by changing the ``PAR'' to a user-defined string in the main input file by specifying:

#!&string
... #input follows

The only two files that need to be in the directory from which ORAC is launched are the main input and the REM.set file (only if the a REM simulation is started from scratch and the scaling factors of the replicas are assigned manually and not automatically (see SETUP(&REM)).

procacci 2021-12-29