Parallel implementation



next up previous
Next: References Up: Parallel Computation of Spectral Previous: An example in

Parallel implementation

The spectral portrait is defined by the values of computed at each point z of the discretized region. All these computations, that are the most time consuming part of the code, are independent and constitute a pool of independent tasks that can be performed simultaneously. To manage this embarrassing parallelism, a parallel manager module has been designed for an implementation on a network of heterogeneous computers. The main features of this parallel manager are:

The parallel implementation is based on a master/slave scheme.

The parallel experiments reported here have been performed on the Electromagnetism matrix. For this test case the matrix is of size 288x288 and the complex region of interest is discretized by a 64x64 grid. We report some performance observed on the Meiko CS2 machine located at CERFACS (16 nodes, each node has two Sparc processors sharing 128 Mb of memory). We vary the granularity of the tasks from one point to maximum size, which is the size of the mesh divided by the number of processors (i.e. only one task per processor). Table 1 (Resp. Table 2) shows the speed-up as a function of the number of nodes, when two processes per node are used (Resp. one process per node).

  
Table 1: Speed-up for different numbers of nodes (with 2 processes per node). Meiko CS-2

  
Table 2: Speed-up for different numbers of nodes (with 1 process per node). Meiko CS-2

 
Table 3:   Speed up for different numbers of workstations and size of the tasks. IBM network

Thanks to the fast communication network of the Meiko CS-2, the best performances are always observed when the task consists only of one point. As it is shown in Table 3, this is no longer true on a network of IBM workstations connected via Ethernet. In this case the ratio communication/computation is poorly balanced and the optimal size of the task depends on the number of workstations involved in the computation. The optimal corresponds to the size which provides us with the best trade-off between the potential load unbalance and the cost of communication (mainly the network latency) associated with the allocation of the tasks.

Finally, we emphasize the scalability of the parallel application. On 24 processors we obtain a speed-up close to 21.



next up previous
Next: References Up: Parallel Computation of Spectral Previous: An example in

Contact: toumazou@cerfacs.fr
Last Update: