Parallel implementation
Next: References
Up: Parallel Computation of Spectral
Previous: An example in
The spectral portrait is defined by the values of
computed at each
point z of the discretized region.
All these computations, that are the most time consuming part of the code,
are independent and constitute a pool of independent tasks that can be performed
simultaneously. To manage this embarrassing parallelism,
a parallel manager module has been
designed for an implementation on a network of heterogeneous computers.
The main features of this parallel manager are:
- Openness and maintainability: The code can
be easily maintained and upgraded. The computation module is kept
separate from the parallel management module. Adding more functionalities
in the computational module should not interfere strongly with the parallelization.
- Similarity with a sequential code to run: Running code on parallel
environment should be equivalent to run it on a serial machine.
- Portability: The resulting parallel program is as little ``machine dependent"
as possible. It is implemented by using a portable parallel strategy
and the portable message passing library PVM.
- Load balancing: The parallel target platforms
are mainly networks of heterogeneous workstations.
The parallel management module takes into account the load unbalance that may
originate either from the variability in the load and in the computational
power of the various nodes, or in the computation required for each task (i.e.
we use an iterative scheme to compute
and the number of iterations depends
on the difficulties encountered to solve the problem).
- Fault tolerances: on a network of workstations one computer may breakdown.
The parallel management module permits to detect the
breakdowns and recover them, in such a way that the parallel simulation does not
breakdown itself. Fault tolerance features are integrated in our implementation.
- Scalability : the parallel manager implements a coarse grain parallelism,
that ensures scalability of the parallel code.
A task consists in the treatment of a subset of points.
Therefore, the granularity is controlled through the number of points per task
(which is set by the user and constant for a complete run).
The parallel implementation is based on a master/slave scheme.
- The master starts a slave process on each computing node and broadcasts
to the slaves the information common to all the computation, mainly the matrix
in sparse format. Then the master sends one point (i.e. task) to each of the slave.
As soon as a slave has finished its computation, it send back the result to the master
and receives another task is there is one left. The tasks are removed from the queue
only when the results associated with them have been received from one slave.
During the computation the master can detect the breakdown of either a node
or a slave process located on a node. If a process breaks down, the
master keeps in the queue the data that were handled by this process and
spawns another process on another node.
- The slave receives a task (i.e. a set of points), performs the computation
of
, send the results back to the master and receives another task if there
is one left.
The parallel experiments reported here have been performed
on the Electromagnetism matrix.
For this test case the matrix is of size 288x288 and the complex
region of interest is discretized by a 64x64 grid.
We report some performance observed on the Meiko CS2 machine located at CERFACS
(16 nodes, each node has two Sparc processors sharing
128 Mb of memory).
We vary the granularity of the tasks from one point to maximum size,
which is the size of the mesh divided by the number of processors
(i.e. only one task per processor).
Table 1 (Resp. Table 2) shows the speed-up as a
function of the number of nodes,
when two processes per node are used (Resp. one process per node).
Table 1: Speed-up for different numbers of nodes (with 2 processes per node).
Meiko CS-2
Table 2: Speed-up for different numbers of nodes (with 1 process per node).
Meiko CS-2
Table 3:
Speed up for different numbers of workstations and size of the tasks. IBM
network
Thanks to the fast communication network of the Meiko CS-2, the best
performances are always observed when the task consists only of one point.
As it is shown in Table 3, this is no longer true on a network of
IBM workstations connected via Ethernet. In this case the ratio
communication/computation is poorly balanced and the optimal size of the task
depends on the number of workstations involved in the computation.
The optimal corresponds to the size
which provides us with the best trade-off between the potential load unbalance
and the cost of communication (mainly the network latency)
associated with the allocation of the tasks.
Finally, we emphasize the scalability of the parallel application.
On 24 processors we obtain a speed-up close to 21.
Next: References
Up: Parallel Computation of Spectral
Previous: An example in
Contact: toumazou@cerfacs.fr
Last Update: