From CAC Wiki
Gaussian 16 Revision A.03 Release Notes Usage Notes: 1. Use of the current generation of NVIDIA GPUs is supported for Hartree-Fock and DFT calculations. Refer to the performance notes for details. 2. Parallel performance on larger numbers of processors has been improved. Refer to the performance notes for details and how to get optimal performance on multiple CPUs, clusters, and GPUs. 3. The parameters which are specified in link-0 (%) input lines and/or in a Default.Route file can now also be specified via either command-line arguments or environment variables. Details are in the input notes. 4. There are new tools for interfacing Gaussian with other programs both in compiled languages such as Fortran and C and with interpreted languages such as Python and Perl. Refer to the interfacing notes for details. Changes in Defaults between Gaussian 09 and Gaussian 16: 1. The follow defaults are different in Gaussian 16: a. Integral accuracy is 10^-12 rather than 10^-10 in G09 b. The default DFT grid for general use is UltraFine rather than FineGrid in G09; the default grid for CPHF is SG1 rather than CoarseGrid. c. SCRF defaults to the symmetric form of IEFPCM (not present in G09) rather than the non-symmetric version. d. Physical constants use the 2010 values rather than the 2006 values in G09. The G09Defaults route keyword sets the defaults back to the G09 values for compatibility with previous calculations, but the new defaults are strongly recommended for new studies. 2. Gaussian 16 defaults memory usage to %mem=100mw. Even larger values are appropriate for calculations on larger molecules and when using many processors; refer to the performance notes for details. 3. TDDFT frequency calculations do analytic second derivatives by default, since these are much faster than the numerical derivatives which were the only choice in G09. ------------------------------------------------------------------------ Building Gaussian 16 from Source Code 1. By default, the build on x86 and x86_64 machines builds for the current CPU type (or the closest supported CPU), uses the current level 3 cache size and does not include support for GPUs. A specific CPU type can be specified as an argument to the build script, e.g. bsd/bldg16 all sandybridge to build executables compiled for Intel sandybridge and later processors even if the current machine is a different type of x86_64. Similarly, bsd/bldg16 all gpu will build with NVIDIA k40 and k80 GPU support and the current type of x86_64 processor or use bsd/bldg16 all gpu sandybridge to turn on both gpu support and a particular CPU type. By default, the build scripts check the size of level 3 cache on each chip and the number of CPUs per chip and set the parameter for the amount of cache each thread should try to use as L3 size divided by number of CPUs. If the executables are being built on one type of systems but will be used primarily on a different machine, then before building one should set the CACHESIZE environment variable to the amount of cache in bytes that each CPU/thread should use. On x86_64 machines the supported machine types are: ia32p4 (32-bit pentium) amd64 (legacy AMD machines) em64t (legacy Intel x86_64 machines) istanbul (less old AMD machines) nehalem (less old Intel machines) sandybridge (Intel Sandybridge machines) haswell (Intel Haswell and Broadwell machines) For IBM Power 8 under Linux the option is: ibmp8le (Little-endian Power 8 linux) (The default is Big-endian Linux) For NEC the option is: necsxace (Ace machine) 2. Building Gaussian 16 with Linda requires Linda version 9.0; the executables will build but will not function with previous version of Linda. The linda 9.0 file should be un-tarred into the g16 directory, then the bsd/fixlinda script should be run to set paths and soft links appropriately and then the linda executables can be built with "mg linda". 3. Building on Intel Macs requires a case-sensitive file system. ------------------------------------------------------------------------ Input Notes Most options that control how Gaussian 16 operates can be specified in any of 4 ways. From highest to lowest precedence these are: 1. As link0 input (%-lines). This is the usual method to control a specific job and the only way to control a specific step within a multi-step input file. 2. As options on the command line. This is useful when one wants aliases or other shortcuts for different common ways of running the program. 3. As environment variables. This is most useful in standard scripts, for example for generating and submitting jobs to batch queuing systems. 4. As lines in a Default.Route file. This is most useful when one wants to change the program defaults for all jobs. When searching for a Default.Route file the current default directory is checked first and then the directories in the path for G16 executables (environment variable GAUSS_EXEDIR, which normally points to $g16root/g16). The parameters which control defaults for Gaussian jobs are: Input line Command Line Environment Variable Default.Route Meaning ---------- ------------ -------------------- ------------- ------- %cpu=... -c="..." GAUSS_CDEF "..." -C- ... which CPUs to use %gpucpu=... -g="..." GAUSS_GDEF "..." -G- ... which GPUs to use and bind to which CPUs. %usersh -s="rsh" GAUSS_SDEF "rsh" -S- rsh Linda should use rsh to start workers %usessh -s="ssh" GAUSS_SDEF "ssh" -S- ssh Linda should use ssh to start workers %lindaworkers=... -w="..." GAUSS_WDEF "..." -W- ... which nodes to use with Linda -r="..." GAUSS_RDEF "..." -R- ... defaults for route -h="..." GAUSS_HDEF "..." -H- ... hostname for archive entry -o="..." GAUSS_ODEF "..." -O- ... organization/site for archive entry The parameters which control defaults for Gaussian utility programs are: GAUSS_FDEF "..." -F- ... default type for formchk (-c, -3, etc.) GAUSS_UDEF "..." -U- ... default memory size for utilities. There are also parameters which are primarily useful when running Gaussian from scripts or external programs: # -x="..." GAUSS_XDEF "..." complete route for job (no route section will be read from the input file). %chk= -y="..." GAUSS_YDEF "..." checkpoint file for job. %rwf= -z="..." GAUSS_ZDEF "..." read-write file for job. Note that the quotation marks are normally required for the command line and environment variables, to avoid modification of the parameter string by the shell. The deprecated parameters %nprocshared and %nproclinda can also be defaulted (flagged by P and L, respectively). ------------------------------------------------------------------------ Parallel Usage and Performance Notes I. Shared-memory parallelism 1. Calculations involving larger molecules and basis sets benefit from larger memory allocations. 4GBytes or more per processor is recommended for calculations involving 50 or more atoms and/or 500 or more basis functions. The freqmem utility estimates the optimal memory size per thread for ground-state frequency calculations and the same value is reasonable for excited-state frequencies and is more than sufficient for ground and excited state optimizations. The amount of memory allowed should rise with the number of processors: if 4GByte is reasonable for one processor, then the same job using 8 CPUs would run well in 32 GBytes. Of course, there may be limitations to smaller values imposed by the particular hardware, but scaling memory linearly with number of CPUs should be the goal. In particular, increasing only the number of CPUs with fixed memory size is unlikely to lead to good performance when using large numbers of processors. For large frequency calculations and for large CCSD and EOM-CCSD energies, it is also desirable to leave enough memory to buffer the large disk files involved. So a Gaussian job should only be given 50-70% of the total memory on the system. For example, on a machine with a total of 128 GBytes, one should typically give 64-80 GBytes to a job which was using all the CPUs and leave the remaining memory for the operating system to use as disk cache. 2. Efficiency is lost when threads are moved from one CPU to another, thereby invalidating the cache and causing other overhead. On most machines Gaussian can tie threads to specific CPUs and this is the recommended mode of operation, especially when using larger numbers of processors. The %cpu link0 line specifies the numbers of specific CPUs to be used. Thus on a machine with one 8-core chip one should use %cpu=0-7 rather than %nproc=8 because the former ties the first thread to CPU 0, the next to CPU 1, etc. On some older Intel processors (Nehalem and before) there is not enough memory bandwidth to keep all the CPUs on a chip busy and it is often preferable to use half the CPUs, each with twice as much memory as if all were used. For example, on such a machine with 4 12-core chips and 128 GBytes of memory, with CPUs 0-11 on the first chip, 12-23 on the second, etc., it is better to run using 24 processors (6 on each chip) and give them 72/24=3GByte memory each, rather than use all 48 with only 1.5GBytes of memory each. The input would be %mem=72GB %cpu=0-47/2 where the /2 means to use every other core, i.e. cores 0, 2, 4, 6, 8, and 10 (on chip 0), 12, 14, 16, 18, 20, and 22 (on chip 1), etc. With the most recent generations of Intel processors (Haswell and later) the memory bandwidth is better and using all the cores on each chip works well. 3. As long as sufficient memory is available and threads are tied to specific cores, then parallel efficiency on large molecules is good up to 64 or more cores. 4. Hyperthreading is not useful for Gaussian since it effectively divides resources such as memory bandwidth among threads on the same physical CPU. If hyperthreading cannot be turned off, Gaussian jobs should use only one hyperthread on each physical CPU. Under Linux, hyperthreads on different processors are grouped together. That is, if a machine has 2 chips each with 8 cores and 3-way hyperthreading, then "CPUs" 0-7 are across the 8 cores on chip 0, 8-15 are across the 8 cores on chip 1, then 16-23 are the second hyperthreads on the 8 cores of chip 0, etc. So a job would run best with %cpu=0-15. Under AIX, hyperthreads are grouped together with up 8 hyperthread numbers for each CPU even if fewer hyperthreads are in use, so with 2 8 core chips and 4-way hyperthreading, "CPUs" 0-3 are all on core 0 of chip 0, 8-11 are on core 1 of chip 0, etc. So one would want to use %cpu=0-127/8 to select "CPUs" 0, 8, 16, etc. which are each using a distinct core. II. Cluster (Linda) parallelism 1. Hartree-Fock and DFT energies, gradients and frequencies run in parallel across clusters as do MP2 energies and gradients. MP2 frequencies, CCSD, and EOM-CCSD energies and optimizations are SMP parallel but not cluster parallel. Numerical derivatives such as DFT Anharmonic frequencies and CCSD Frequencies, are parallelized across nodes of a cluster by doing complete gradient or second derivative calculation on each node, splitting the directions of differentiation across workers in the cluster. 2. Shared-memory and cluster parallelism can be combined. Generally, one uses shared-memory parallelism across all CPUs in each node of the cluster. Note that %cpu and %mem apply to each node of the cluster. Thus if one has 3 nodes names a, b, and c, each with 2 chips which have 8 CPUs each, then one might specify %mem=64gb %cpu=0-15 %lindaworkers=a,b,c #p b3lyp/6-31g* freq This would run 16 threads, each pinned to a CPU, on each of the 3 nodes, giving 4Gbytes to each of the 48 threads. 3. For the special case of numerical differentiation (Freq=Anharm, CCSD Freq, etc.) only, one extra worker is used to collect the results. So these jobs should be run with two workers on the master node (where g16 is started). For the above example if the job was doing anharmonic frequencies, then one would do %mem=64gb %cpu=0-15 %lindaworkers=a:2,b,c #p b3lyp/6-31g* freq=anharm where g16 is assumed to be started on node a. This will start 2 workers on node a, one of which just collects results, and will do the computational work using the other worker on a and those on b and c. III. Using GPUs. 1. Gaussian 16 can use NVIDIA K40 and K80 GPUs under Linux. Earlier GPUs do not have the computational capabilities or memory size to run the algorithms in G16. Allowing larger amounts of memory is even more important when using GPUs than for CPUs, since larger batches of work must be done at the same time in order to use the GPUs efficiently. The K40 and K80 can have up to 16 GBytes of memory and one typically tries to have most of this available for Gaussian, which requires at least an equal amount of memory for the CPU thread which is running each GPU. 8 or 9 GBytes works well if there is 12 GByte total on each GPU, or 11-12 GBytes for a 16GByte GPU. 2. When using GPUs it is essential to have the GPU controlled by a specific CPU and much preferable if the CPU is physically close to the GPU it is controlling. The hardware arrangement can be checked using the nvidia-smi utility. For example, this output is for a machine with 2 16-core Haswell CPU chips and 4 K80 boards, each of which has two GPUs: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15 GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15 GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31 GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31 GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31 GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31 GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31 GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31 The important part is the CPU affinity. This shows that GPUs 0 and 1 (on the first K80 card) are connected to the CPUs on chip 0 while GPUs 2-7 (on the other three K80 cards) are connected to the CPUs on chip 1. So a job which uses all the CPUs (24 CPUs doing parts of the computation and 8 controlling GPUs) would use input %cpu=0-31 %gpucpu=0-7=0-1,16-21 or equivalently but more verbosely %cpu=0-31 %gpucpu=0,1,2,3,4,5,6,7=0,1,16,17,18,19,20,21 This pins threads 0-31 to CPUs 0-31 and then uses GPU0 controlled by CPU 0, GPU1 controlled by CPU 1, GPU2 controlled by CPU 16, etc. Normally one uses consecutive numbering in the obvious way, but things can be associated differently in special cases. For example, suppose on the other machine one already had one job using 6 CPUs running with %cpu=16-21. Then if one wanted to use the other 26 CPUs with 8 controlling GPUs one would specify: %cpu=0-15,22-31 %gpucpu=0-7=0-1,22-27 This would create 26 threads with GPUs controlled by the threads on CPUs 0,1,22,23,24,25,26, and 27. 3. GPUs are not helpful for small jobs but are effective for larger molecules when doing DFT energies, gradients and frequencies (for both ground and excited states). They are not used effectively by post-SCF calculations such as MP2 or CCSD. Each GPU is several times faster than a CPU but since on modern machines there are typically many more CPUs than GPUs, it is important to use all the CPUs as well as the GPUs and the speedup from GPUs is reduced because many CPUs are also used effectively (i.e., in a job which uses all the CPUs and all the GPUs). For example, if the GPU is 5x faster than a CPU, then the speedup from going to 1 CPU to 1 CPU + 1 GPU would be 5x, but the speedup going from 32 CPUs to 32 CPUs + 8 GPUs would be 32 CPUs -> 24 CPUs + 8 GPUs, which would be equivalent to 24 + 5x8 = 64 CPUs, for a speedup of 64/32 or 2x. 4. GPUs on nodes in a cluster can be used. Since the %cpu and %gpucpu specifications are applied to each node in the cluster, the nodes must have identical configurations (number of GPUs and their affinity to CPUs); since most clusters are collections of identical nodes, this is not usually a problem. IV. CCSD, CCSD(T) and EOM-CCSD calculations These calculations can use memory to avoid I/O and will run much more efficiently if they are allowed enough memory to store the amplitudes and product vectors in memory. If there are O active occupied orbitals (NOA in the output) and V virtual orbitals (NVB in the output) then approximately 9O^2V^2 words of memory are required. This does not depend on the number of processors used.