This is a short introduction to the OpenMP industry standard for shared-memory parallel programming. It outlines basic features of the system and explains its usage on our machines. This is not an introduction to OpenMP programming. References and links for further details are given.
What is OpenMP ?
OpenMP is a system compiler directives that are used to express parallelism on a shared-memory machine. OpenMP has become an industry standard for such directives, and at this point, most parallel enabled compilers that are used on SMP machines are capable of processing OpenMP directives. The OpenMP standard has had a rather short and steep career: it was introduced in 1997 and has since sidelined all other similar systems.
OpenMP is exclusively designed for shared-memory machines, and is based on multi-threading, i.e. the dynamic spawning of sub-processes, commonly within loops. In favourable cases it is quite possible to create a well-scaling parallel program from a serial code by inserting a few lines of OpenMP directives into the serial precursor and recompiling. The simplicity and ease of use of OpenMP directives have made it a popular alternative to the more involved (and arguably more powerful) communication system MPI, which was designed for distributed-memory systems.
What kind of system uses OpenMP ?
OpenMP was designed from the outset for shared-memory machines, commonly called SMP (Symmetric Multi-Processor) machines. These types of parallel computers have the advantage of not requiring communication between processors for parallel processing, and therefore bypassing the associated overhead. In addition, they allow multi-threading, which is a dynamic form of parallelism in which sub-processes are created and destroyed during program execution. In some cases this can be done automatically at compile time. In other cases, the compiler needs to be instructed about details of the parallel region of code where multi-threading is to take place. OpenMP was designed to perform this task.
OpenMP therefore needs both a shared-memory (SMP) computer and a compiler that understands OpenMP directives. The nodes on our clusters meet these requirements, but one has to ensure that all processes are scheduled on a single node.
OpenMP will not work on distributed-memory hardware, i.e. clusters. It may sometimes be used with combination with distributed-memory parallel systems such as MPI. However, this holds only if each of the nodes in the cluster has multiple CPUs or cores.
OpenMP is usually used in the stepwise parallelization of pre-existing serial programs. Shared-memory parallelism is often called "loop parallelism" because of the typical situation that make OpenMP compiler directives an option.
The OpenMP compiler directives are inserted into the serial code by the user. They instruct the compiler to distribute the tasks performed in a certain region of the code (usually a loop) over several sub-processes, which in turn may be executing on different CPUs.
For instance, the following Fortran loop looks as if the repeated calls to the functionpoint() could be done in seperate processes, or better on separate CPUs:
do imesh=inz,nnn,nstep svec(1)=xmesh(imesh) svec(2)=ymesh(imesh) svec(3)=zmesh(imesh) integral=integral+wints(imesh)*point(svec) end do
If we are using a compiler that is able to automatically parallelize code, and try to use that feature, we will find that things are not that simple. The function call topoint may hide a "loop dependency", i.e. a situation where data computed in one loop iteration depend on data calculated in another. The compiler will therefore commonly reject parallelizing such a loop as "unsafe".
The use of OpenMP directives can solve this problem:
!$omp parallel do private (imesh,svec) & !$omp shared (inz,nnn,nstep,xmesh,ymesh,zmesh,wints) & !$omp reduction(+:integral) do imesh=inz,nnn,nstep svec(1)=xmesh(imesh) svec(2)=ymesh(imesh) svec(3)=zmesh(imesh) integral=integral+wints(imesh)*point(svec) end do !$omp end parallel do
The three lines of directives have the effect of forcing the compiler to distribute the tasks performed in each of the loop iterations over seperate, dynamically created processes. Furthermore, they inform the compiler which variables can be used by all sub-processes (ie, shared), and which have different values for each process (ie, private). Finally, they direct the compiler to collect values of integral sperately in each process and then "reduce" them to a common value by summing them up.
OpenMP programs need to be compiled with special compiler options and will then yield parallel code. It must be pointed out that since the compiler is forced to multi-thread specific regions of the code, it is the responsibility of the programmer to ensure that such multi-threading is safe, i.e. no dependeny between iterations in the parallelized loop exist. In the above example that means that the tasks performed inside the point call are indeed independent.
The working principle of OpenMP is perhaps best illustrated on the grounds of a programming example. The following program written in Fortran 90 computes the sum of all square-roots of integers from 0 up to a specific limit m:
program example02 call demo02 stop end subroutine demo02 integer:: m, i real*8 :: mys write(*,*)'how many terms?' read(*,*) m mys=0.d0 !$omp parallel do private (i) shared (m) reduction (+:mys) do i=0,m mys=mys+dsqrt(dfloat(i)) end do write(*,*) 'mys=',mys, ' m:',m return end
It is instructive to compare this example with the one in our MPI Help File which performs exactly the same task. It is obvious that the OpenMP version is a good deal shorter. In fact, apart from the OpenMP directives (starting with !$omp), this is just a simple serial program.
In Fortran 90, anything after a ! sign is commonly interpreted as a comment, so that the above example when compiled without special options will just yield the serial version of the program. If the -openmp option is specified at compile time, the compiler will use the OpenMP directives to create a multi-threaded executable.
The instruction parallel do will cause the next do-loop to be taken as parallel region, in other words before executing that loop, multiple sub-processes will be created and the loop iterations will be distributed to those processes. The order in which this happens should not matter, since we have to be sure that the iterations are independent.
The instruction private(i) ensures that a separate value of the loop index is used for each process or thread. By default, scalar values such as i are considered private, i.e. thread-specific. We are specifying this only for demonstration purposes. It is in any case a good idea not to rely on default settings.
The instruction shared(m) makes sure that all threads are using the same maximum value. This declaration is necessary, since only arrays are considered "shared" by default. Since we are not doing anything with m, it is safe to assume a common value.
Finally, we instruct the compiler to treat the value of mys specially. Thereduce(+:mys) instruction causes a private value for mys to be initiatialized with the current mys value before thread creation. After all loop iterations have been completed, the different private values are reduced to a single on by a sum (+ sign in the directive).
After compilation, we can convince ourselves easily that we have in fact created a parallel program. Here is the execution with a maximum of m=100,000,000 and only one thread:
$ OMP_NUM_THREADS=1 time -p ./a.out < test.in how many terms? mys= 28918862541603.9 m: 1234567890 real 3.99 user 3.96 sys 0.01
And here's the same run with two threads:
$ OMP_NUM_THREADS=4 time -p ./a.out < test.in how many terms? mys= 28918862541603.6 m: 1234567890 real 1.04 user 4.03 sys 0.02
We note that the result is the same to 12 significant digits, but the time of the second run is only slightly more than 1/4 of the first. Increasing the number of threads by a factor of 4 has decreased the runtime by the same factor. The deviation from linear scaling is due to overhead such as IO, internal communication and serial portions of the underlying program.
Implementation on our Systems
The Compilers on our machines are capable of processing OpenMP directives. No special settings need to be specified in setup files to use this capability. To compile programs with OpenMP, the addition
Compiling OpenMP code
To enable the interpretation of OpenMP compiler directives, the a compiler option has to be specified when compiling. For our compiler this is:
-qopenmp (for Intel compilers)
This holds for all (Fortran, C, and C++) compilers. It is useful to also use additional options that create information on what has been parallelized. For instance, in the case of a Fortran program:
ifort -qopenmp -qopt-report -c test.f90
Note that the -qopenmp option is a macro which includes several sub options. Also, if no optimization is specified (as in the above lines), the optimization level will automatically be increased to support multi-threading. This cannot be disabled. The -qopt-report option does not alter the behavior of the compoiler, but generates a detailed report about optimization (in a file "test.optrpt") that includes information about OpenMP parallelization.
Unlike MPI programs, shared-memory parallel OpenMP programs do not need a special runtime environment to run in parallel. They only need to be instructed about the number of threads (or processes) that should be used. This is usually done by setting an environment variable. The default variable used on any system that is OpenMP enabled is OMP_NUM_THREADS. For instance:
runs the program test_omp.exe with 16 threads in parallel. Incidentally, the number of threads may also be set from inside the program by means of a function call. The line
inside the program has the same effect as setting the environment variable. This will take precedence over external settings.
As already pointed out, this is not an introduction to OpenMP programming. In fact, we barely scratch the surface of what can be done. OpenMP includes many directives and functions that warrant study before they can be used properly. It is necessary to point out that shared-memory programming has its pitfalls:
Often, a detailed analysis of a parallel region or a loop is necessary to determine if and how it may be parallelized using OpenMP.
A good online tutorial for OpenMP shared-memory programming can be found at Lawrence Livermore National Laboratory.
There is a website devoted specifically to all things OpenMP, which is a good starting point for learning about it.
The Centre for Advanced Computing offers Workshops on a regular basis, some of them devoted to OpenMP programming. They are announce on our website. We might see you there sometime soon.
A good way to check the performance of a multi-threaded program is timing it by insertion of suitable routines. This can be done by calling the subroutines ETIME and DTIME, which can give you information about actual CPU time used. However, it is advisable to carefully read the documentation before using them with OpenMP programs.
We also provide a package called the HPCVL Working Template (HWT), which was created by Gang Liu. The HWT provides 3 main functionalities:
The HWT is based on libraries and script files. It is easy to use and portable (written largely in Fortran). Fortran, C, C++, and any mixture thereof are supported, as well as OpenMP and MPI for parallelism. Documentation of the HWT is available. The package is installed on the Sunfire cluster in /usr/local/hwt.
Send email to firstname.lastname@example.org. We have scientific programmers on staff who will probably be able to help you out. Of course, we can't do the coding for you but we do our best to get your code ready for multi-processor machines.