| III: MESSAGE PASSING INTERFACE: Dr. Richard S. Miller III(A): Introduction The Message Passing Interface (MPI) standard is the result of a series of workshops and conferences by the parallel computation community that culminated in 1994. The standard was developed to meet the needs of the community for an open source and portable standard of message passing routines for interprocessor communication on parallel computers. A second forum was held in 1998 to correct minor problems in the original standard and to add additional features. The result of this work by the community is the MPI-2 standard. One of the more popular MPI `packages' is the MPICH (for MPI chameleon) implementation which is freely available at: http://www/mcs.anl.gov/mpi/mpich In what follows, the author will attempt to provide a concise, yet far from thorough, introduction to the use of MPI in parallel programming. It is highly recommended that the user interested in becoming proficient with MPI read the following introductory text: Gropp, W., Lusk, E. and Skjellum, A., `Using MPI: Portable Programming with the Message-Passing Interface,' Second Edition, Massachusetts Institute of Technology Press, 1999. For advanced MPI programming and MPI-2 features, the user is referred to: Gropp, W., Lusk, E. and Thakur, R., `Using MPI-2: Advanced Features of the Message-Passing Interface,' Massachusetts Institute of Technology Press, 1999. MPI Programming: MPI is essentially a set of subroutines which are called by the user inside a program at specific locations to perform specific inter processor communication tasks. Both Fortran and C versions of the MPI routines are in place, and both are freely available at the above site within the MPICH implementation. The MPI-2 standard also allows for new processes (threads) to be added dynamically during program implementation. In total, far more than one hundred individual routines are included in the MPI and MPI-2 standards. Fortunately for the novice, only a limited subset of these routines is generally used in typical parallel programming applications. The author's research codes discussed in Section I use only fourteen of these routines. In fact, only six routines are required to create a parallel program. Two of these routines simply `turn on' and `turn off' MPI. The basic six MPI routines are:
MPI_INIT(...): Initialize MPI
MPI_COMM_SIZE(...): Determine how many processors there are
MPI_COMM_RANK(...): Determine which processor `I' am
MPI_SEND(...): Send the specified data
MPI_RECV(...): Receive incoming data
MPI_FINALIZE(...): Terminate MPI
The eight additional routines commonly used by the author are:
MPI_CART_CREATE(...): Defines a new communicator for Cartesian topology
MPI_CART_SHIFT(...): Returns shifted neighbor processor identifications
MPI_CART_GET(...): Returns Cartesian processor topology
MPI_SENDRECV(...): Combined send and receive
MPI_GATHER(...): Gathers values from groups of processors
MPI_BCAST(...): Sends values to groups of processors
MPI_BARRIER(...): Stop here until all specified processors catch up
Each of the above routines is inserted into a code with their required arguments to accomplish the desired task. The communication routines operate on integers, single precision, or double precision real scalars or arrays of any dimension. Specification of the routine arguments is discussed below. III(B): Conceptual MPI In the author's experience, it is not the command syntax which gives the novice user the most difficulty in learning MPI, it is the conceptual understanding of how MPI spawns threads to different processors. This will therefore be the first topic covered. The most important concept to understand involving the implementation of parallel programs is the concept of how the individual threads are being run on each processor involved in the total program. Consider some generic MPI based parallel program to be run on two processors. In its simplest form, when the `mpirun' command is used to call the executable file, two individual and identical copies of the executable are `sent' to the two processors. It is easiest to consider that these are two individual programs running on two machines. Unless specifically instructed by the user to communicate with one another, neither program has any knowledge whatsoever of the existence of the other program. Both programs proceed through the implementation of the executable just as they would on a serial machine, proceeding line by line, calling subroutines, and otherwise acting in the normal program manner. The MPI routines allow the user to give each program specific instructions as to how to behave in a manner consistent with the presence and complementary nature of the other program. For example, the `MPI_COMM_RANK' routine is called to return a processor designation, typically denoted `myid' for `my identification number', which ranges from 0 to NCPUS-1, where NCPUS is the total number of processors. So, while proceeding through the hypothetical two processor problem, both processors may come to some `if statement' specifying that if myid=0 then this processor performs some set of tasks. The second processor, not having myid=0, will skip this set of tasks and simply proceed with whatever lines of code follow the if statement. This can cause problems if care is not taken as the two programs can become `out of sync' with one another. At some point the individual programs will generally also need some information from the other program; such as a boundary value to calculate a finite difference expression using a stencil which spans two subdomains. At this point, the user must specify that the `normal' progression of each code be paused, and communication routines be invoked. Again, care must be taken by the user to ensure that the individual threads are `in sync' at the point at which they require shared information; otherwise erroneous calculations may occur. Again, in the author's experience it is this concept of individual and identical copies of programs being run by each processor which is one of the biggest impediments to the novice parallel programmer. It is suggested that the reader keep this in mind when `deciphering' the example programs which follow. III(C): Example Fortran Codes Three example MPI programs are provided below. The first program is perhaps the simplest MPI program imaginable. It simply intitializes a group of processors and has each one print to the screen its identification number. The second is a relatively simple program for calculating the value of PI through trapezoidal integration (this program is included with the MPICH implementation from Argonne Labs). In this and the first program, the number of processors (#) is determined solely by the user when invoking the `mpirun -np # pi.f' run command. The third program solves a one-dimensional wave equation using a simple explicit `forward time, backward space' finite difference technique. In this case, the finite difference domain is divided into equal adjacent subdomains on which the individual processors solve the wave equation. MPI is used to communicate boundary values to adjacent processors. The number of processors called with the mpirun command must match the number of processors declared within parameter statements. Discussion of the particular arguments in the MPI subroutine calls is delayed until the following section in order to simplify the presentation; however, the reader may find it valuable to reference Section (D) while studying the following programs: PROGRAM 1:
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
program main
C The include 'mpif.h' must be included in all MPI programs
include 'mpif.h'
integer myid, numprocs, ierr
C Turn on MPI.................................................
call MPI_INIT( ierr )
C Retrieve the identification number `myid' of each processor
C (MPI_COMM_WORLD is a designation for all processors; subsets
C of processors may also be defined in MPI)
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
C Determine the total number of processors `numprocs'
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
c..............................................................
print *, 'Process ', myid, ' of ', numprocs, ' is alive'
C Turn off MPI
call MPI_FINALIZE(ierr)
stop
end
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
The above program can be compiled and run using:
mpif77 print.f
mpirun -np 4 ./a.out
which will invoke four processors in this case. A typical output to the screen will be:
Process 0 of 4 is alive
Process 3 of 4 is alive
Process 2 of 4 is alive
Process 1 of 4 is alive
Note that this should be considered to be four independent programs each being told to print to the screen. The only time that any of the four copies of code have any inclination as to the existence of the remaining three codes is during calls to MPI functions. Therefore, whichever processor `happens' to arrive at the print statement first will print to the screen first. Therefore, the order in which the printings occur will depend on the computer hardware as well as the relative levels of usage by other users on each processor (note that some operating systems seem to delay output to the screen for all but processor `0' until the end of the program). PROGRAM 2:
c**********************************************************
c pi.f - compute pi by integrating f(x) = 4/(1 + x**2)
c Each node:
c 1) receives the number of rectangles used in the approximation.
c 2) calculates the areas of it's rectangles.
c 3) Synchronizes for a global summation.
c Node 0 prints the result.
c Variables:
c pi the calculated result
c n number of points of integration.
c x midpoint of each rectangle's interval
c f function to integrate
c sum,pi area of rectangles
c tmp temporary scratch space for global summation
c i do loop index
c myid processor identification number
c**********************************************************
program main
C The include 'mpif.h' must be included in all MPI programs
include 'mpif.h'
double precision PI25DT
parameter (PI25DT = 3.141592653589793238462643d0)
double precision mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, ierr
c function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
C Turn on MPI.................................................
call MPI_INIT( ierr )
C Retrieve the identification number `myid' of each processor
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
C Determine the total number of processors `numprocs'
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
c..............................................................
print *, 'Process ', myid, ' of ', numprocs, ' is alive'
sizetype = 1
sumtype = 2
10 if ( myid .eq. 0 ) then
write(6,98)
98 format('Enter the number of intervals: (0 quits)')
read(5,99) n
99 format(i10)
endif
C Broadcast the integer `n' from processor 0 to all processors
call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
C check for quit signal
if ( n .le. 0 ) goto 30
C calculate the interval size
h = 1.0d0/n
sum = 0.0d0
do 20 i = myid+1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20 continue
mypi = h * sum
C Collect and sum all the partial sums and bring to processor 0
call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
$ MPI_COMM_WORLD,ierr)
C node 0 prints the answer.
if (myid .eq. 0) then
write(6, 97) pi, abs(pi - PI25DT)
97 format(' pi is approximately: ', F18.16,
c ' Error is: ', F18.16)
endif
goto 10
C Turn off MPI
30 call MPI_FINALIZE(ierr)
stop
end
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
PROGRAM 3: Solution of the one dimensional wave equation using a `forward time, backward space' finite difference scheme on a periodic domain.
program main
C written by Vikas Khatumria....
include 'mpif.h'
C Define the total number of grid points, the number of processors,
C and the actual number of grid points in each subdomain:
PARAMETER (NTOTAL=40,NCPUS=4,N=NTOTAL/NCPUS)
C Each processor only requires arrays of N dimension:
real X(N),F(N),FN(N),X0(N),F0(N)
C MPI stuff:
integer numprocs,ICOMM,nxminus,nxplus,ierr,myid
integer dims(1),icoords(1)
integer status(MPI_STATUS_SIZE)
logical periods(1), reorder
real alpha,pi,time,tfin,q
real delx,delt,pp
c....................................................
C Define the Courant number (2PI*deltat/deltax)
ALPHA=0.3
PI=4.*atan(1.)
TIME=0.
C TFIN is the final time for the simulation
TFIN=1.
C Turn on MPI
call MPI_INIT(ierr)
C Parameters used to define periodic cartesian processor `grid':
dims(1)=NCPUS
periods(1)=.true.
C Allows the hardware to optimize processor arrangement:
reorder=.true.
c..
c.. Find global ranking and number of processors.
c..
call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
c..
c.. Hardware optimized processor configurator and new ranking.
c..
call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,periods,reorder,
c ICOMM,ierr)
call MPI_COMM_RANK(ICOMM,myid,ierr)
c..
c.. Find nearest neighbor ranks.
c..
call MPI_Cart_shift(ICOMM,0,1,nxminus,nxplus,ierr)
call MPI_CART_GET(ICOMM,1,idims,periods,icoords,ierr)
c..
print*, myid,nxminus,nxplus,icoords(1)
c..
DELX=2.0*PI/(float(NTOTAL))
DELT=ALPHA*DELX/(2.*PI)
NSTEPS=nint(TFIN/DELT)
c..
C Initialize the position and initial conditions of grid points
C which are dependent on the processor location:
do 10 i=1,N
x(i)=(float(i-1))*DELX+(float(icoords(1)))*2.*PI/
c (float(NCPUS))
F(i)=sin(X(i))
FN(i)=0.
F0(i)=0.
10 continue
c
c.. Begin main loop:
do 20 j=1,NSTEPS
if(myid.eq.0) print*, ' STEP: ', j
c..
C Send and receive boundary condition values:
pp=F(N)
isend=myid+1000
irecv=nxminus+1000
call MPI_Sendrecv(pp,1,MPI_REAL,nxplus,isend,
c q,1,MPI_REAL,nxminus,irecv,ICOMM,status,ierr)
c..
C Calculate new function values at next time step level:
c..
FN(1)=F(1)-ALPHA*(F(1)-q)
do 30 i=2,N
FN(i)=F(i)-ALPHA*(F(i)-F(i-1))
30 continue
c..
TIME=TIME+DELT
c.. Update the function values:
do 50 i=1,N
F(i)=FN(i)
50 continue
20 continue
c..
c.. Output data for plotting:
c.. First write out the data from the master to the output file
c..
if(myid.eq.0) then
open(unit=10,file='output.dat',status='unknown')
do i=1,N
write(10,*) x(i),f(i)
end do
endif
c..
c.. Barrier to sync CPUs
c..
call MPI_BARRIER(ICOMM,ierr)
c..
c.. Next cycle through each processor who will send data back to master
c..
do 19 nproc=1,NCPUS-1
c..
c.. Send the function values f to scratch f0 on master
c..
isend=nproc+1000
if(myid.eq.nproc) then
call MPI_Send(F,N,MPI_REAL,0,isend,ICOMM,ierr)
endif
if(myid.eq.0) then
call MPI_Recv(F0,N,MPI_REAL,nproc,isend,ICOMM,status,ierr)
endif
c..
isend=nproc+1000
if(myid.eq.nproc) then
call MPI_Send(X,N,MPI_REAL,0,isend,ICOMM,ierr)
endif
if(myid.eq.0) then
call MPI_Recv(X0,N,MPI_REAL,nproc,isend,ICOMM,status,ierr)
endif
c..
if(myid.eq.0) then
do i=1,N
write(10,*) X0(i),f0(i)
enddo
endif
c..
call MPI_BARRIER(ICOMM,ierr)
c..
19 continue
C Terminate MPI:
call MPI_FINALIZE(ierr)
STOP
END
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
III(D): MPI Routines and Arguments A complete listing of all MPI routines (Fortran bindings) and their argument structures can be found on the MPICH home page: http://www-unix.mcs.anl.gov/mpi/www/www3/ III(E): Programming Tips In addition to the implementation of the individual MPI routines within a program, parallel programming results in some `problems' with `normal' programming techniques. Most of these problems result from the fact that no processor has access to the entire memory distribution of the overall] program (this is true even on shared memory computers when MPI is used). Several common parallel programming mistakes, problems and tips are provided in what follows: Common Blocks: MPI and other parallel based programs *do not* like common blocks due to the way memory is handled. Passing arrays via common blocks should be avoided as much as possible. The author uses common blocks only for passing single valued scalars; not for arrays. Keep the number of scalars within each common block very small. Bugs resulting form common blocks can be very difficult to trace, and sometimes only occur on specific computers (the SGI Origin 2000 is particularly `finicky' in the author's experience). Equivalence Statements: The author has no direct experience with using equivalence statements in MPI programs; however, he has been told to absolutely never use equivalence statements for memory allocation related reasons. Passing Through Subroutines: The author passes *all* arrays between subroutines using the subroutine calls. This can result in very long and messy looking call subroutine call statements; however, it is a very safe way to pass information within a parallel program (these calls are simply `pointers' to a location in the memory, and therefore incur no extra computational tolls). Stopping All Processors: Remember, if a `stop' statement is called within an `if(myid.eq.0) then' statement, only processor `0' will be told to stop. All other processes will generally simply `hang' waiting for their next communication statement they come across. Periodic Sendrecv: The author's preferred method of communication, when appropriate, is the combined send and receive command; MPI_SENDRECV. This command is particularly simple to implement when the processors are arranged periodically. For example, the command can tell `everyone' to send to the right, and receive simultaneously from the left. Since all processors are involved it can be used to serve the simultaneous function of the MPI_BARRIER command without the added overhead; thereby keeping the processors in sync simply, and without added cost. Barrier Statements: The use of MPI_BARRIER statements can be extremely useful when first writing a parallel code, since processors can be forced to remain in sync in a very predictable manner. However, the MPI_BARRIER command will also slow down the code dramatically when called often. It is suggested that first versions of codes use the barrier command liberally; after which the barriers should be removed as much as possible, while confirming that no errors result. Compiler Optimization: Use of the -O2 and -O3 compiler optimization flags can result in a substantial increase in code performance. The author's finite difference code runs in some cases more than three times faster when compiled with the -O3 flag; making it more than worth the added compilation time. Not all codes will experience such dramatic gains, nevertheless these optimizations should be tested to determine their effects. This particular tip is relevant to all code writing; not only parallel codes. Code and Subroutine Timing: The MPI_WTIME() can be used within an MPI program to call the clock time [eg. time=MPI_WTIME()]. No arguments appear within the parentheses. This provides a very useful way to check code or subroutine performance. Keep It Simple: This is perhaps the best advice the author can give. As noted above, MPI contains well over one hundred subroutine calls and alot of time could be spent trying `fancy' ways of optimizing a code. However, MPI code can also be extremely difficult to debug, and more often than not, simple is better... |
Home
Publications
Teaching
Research
Team
Tutorial
Beowulf
Quotes