Introducción a la computación de altas prestaciones

Anuncio
Introducción a la computación
de altas prestaciones
!!"
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Questions
Why Parallel Computers?
How Can the Quality of the Algorithms be Analyzed?
How Should Parallel Computers Be Programmed?
Why the Message Passing Programming Paradigm?
Why de Shared Memory Programming Paradigm?
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OUTLINE
Introduction to Parallel Computing
Performance Metrics
Models of Parallel Computers
The MPI Message Passing Library
Examples
The OpenMP Shared Memory Library
Examples
Improvements in black hole detection using parallelism
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Why Parallel Computers ?
Applications
Demanding
Computational Power:
more
Artificial Intelligence
Weather Prediction
Biosphere Modeling
Processing of Large Amounts of
Data (from sources such as
satellites)
Combinatorial Optimization
Image Processing
Neural Network
Speech Recognition
Natural Language Understanding
etc..
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Top500
www.top500.org
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Performace Metrics
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Ts = Sequential Run Time: Time elapsed between the begining
and the end of its execution on a sequential computer.
Tp = Parallel Run Time: Time that elapses from the moment that
a parallel computation starts to the moment that the last
processor finishes the execution.
Speed-up: T*s / Tp ? p
T*s = Time of the best sequential algorithm to solve the problem.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear Speed-up
Linear and Actual Speed-up
100
Speed-up
80
60
40
20
0
0
10 20 30 40 50 60 70 80 90 100
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear Speed-up
Actual Speed-up1
Linear and Actual Speed-up
100
Speed-up
80
60
40
20
0
0
10 20 30 40 50 60 70 80 90 100
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear Speed-up
Actual Speed-up1
Actual Speed-up2
100
Speed-up
80
60
40
20
0
0
10 20 30 40 50 60 70 80 90 100
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear Speed-up
Actual Speed-up1
Actual Speed-up2
100
Optimal Number
of
Processors
Speed-up
80
60
40
20
0
0
10 20 30 40 50 60 70 80 90 100
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Efficiency
In practice, ideal behavior of an speed-up equal to p
is not achieved because while executing a parallel
algorithm, the processing elements cannot devote
100% of their time to the computations of the
algorithm.
Efficiency: Measure of the fraction of time for which a
processing element is usefully employed.
E = (Speed-up / p) x 100 %
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Efficiency
Optimum Efficiency
Actual Efficiency-1
Actual Efficiency-2
120
Efficiency (%)
100
80
60
40
20
0
0
10
20
30
40 50 60
Processors
70
80
90
100
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Amdahl`s Law
Amdahl`s law attempt to give a maximum bound for speed-up from the
nature of the algorithm chosen for the parallel implementation.
Seq = Proportion of time the algorithm needs to be spent in purely
sequential parts.
Par = Proportion of time that might be done in parallel
Seq + Par = 1 (where 1 is for algebraic simplicity)
Maximum Speed-up = (Seq + Par) / (Seq + Par / p) = 1 / (Seq + Par / p)
%
Seq
Par
Maximum
Speed-up
0,1
0,001
0,999
500,25
0,5
0,005
0,995
166,81
1
0,01
0,99
90,99
10
0,1
0,9
9,91
p = 1000
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Example
A problem to be solved many times over several
different inputs.
Evaluate F(x,y,z)
x in {1 , ..., 20}; y in {1 , ..., 10}; z in {1 , ..., 3};
The total number of evaluations is 20*10*3 = 600.
The cost to evaluate F in one point (x, y, z) is t.
The total running time is t * 600.
If t is equal to 3 hours.
The total running time for the 600 evaluations is 1800
hours
75 days
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speed-up
Linear Speed-up
Actual Speed-up1
Actual Speed-up2
Super-Linear Speedup
100
Speed-up
80
60
40
20
0
0
10 20 30 40 50 60 70 80 90 100
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Models
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Sequential Model
The
RAM
model
express
computations on von Neumann
architectures.
The von Neumann architecture is
universally accepted for sequential
computations.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Parallel Model
Computational Models
Programming Models
!
Architectural Models
"" "
#
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Address-Space Organization
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Digital AlphaServer 8400
Hardware
"
#
$%
&
!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
SGI Origin 2000
Hardware
"
,
#) *+*$+#
$*%
&
'
(
!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (1/2)
jen50.ciemat.es
160 processors MIPS R14000 / 600MHz
On 40 nodes with 4 processors each
Data and instruction cache on-chip
Irix Operating System
Hypercubic Network
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (2/2)
cc-Numa memory Architecture
1 Gflops Peak Speed
8 MB external cache
1 Gb main memory each proc.
1 Tb Hard Disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Beowulf Computers
-%".
* - %
"
$+
)*+*$+#
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Towards Grid Computing….
Introducción a la Computación
de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Source: www.globus.org
& updated
The Parallel Model
Computational Models
Programming Models
!
$
Architectural Models
"" "
#
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Drawbacks that arise when solving
Problems using Parallelism
Parallel Programming
complex than sequential.
is
more
Results may vary as a consequence
of the intrinsic non determinism.
New
problems.
starvation...
Deadlocks,
Is more difficult to debug parallel
programs.
Parallel programs are less portable.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Message Passing Model
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Message Passing Model
processor
processor
processor
Interconnection Network
processor
processor
processor
processor
!%$
&
'%$
&
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI
$'
+
)$
*#$ !
$(
"" "
#
, #
"" "
$$"# #
"" "
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI
What Is MPI?
• Message Passing Interface standard
• The first standard and portable message passing library with good
performance
• "Standard" by consensus of MPI Forum participants from over 40
organizations
• Finished and published in May 1994, updated in June 1995
• What does MPI offer?
• Standardization - on many levels
• Portability - to existing and new systems
• Performance - comparable to vendors' proprietary libraries
• Richness - extensive functionality, many quality implementations
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
A Simple MPI Program
"" .# " ! / !#- 0
.# " ! 1 $#- 1
# %#
2 '34& 5
#
$
!
6 78
9 #%:
:
'&8
9
9
;%
9
9<
9
9 #= %
9
9<
$ # %> ""
9 # "#
= %&8
$
?!
$> mpicc -o hello hello.c
+:
+ :$&8
? !@ 1
$> mpirun -np 4 hello
Hello from processor 2
Hello from processor 3
Hello from processor 1
Hello from processor 0
&8
$&8
A
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
of
of
of
of
4
4
4
4
A Simple MPI Program
""
.# " ! / !#- 0
.# " ! / # - 0
.# " ! 1 $#- 1
# %#
2
#
$
9
9
9
9
?!
1
? !@ 1
#
$&8
? !C1
$
&8
'34& 5
!
3B7748
8
#%:
:
9
;%
9 #= %
#%
C6 7& 5
$ # %1
$# %
-
!
6 78
'&8
9
9
$> mpirun
Processor
Processor
Processor
processor
greetings
greetings
greetings
9<
9<
6 78
9
!%
9
9
+:
+ :$&8
A " 5
$ # %1$
%
6B8
9
'%
&8
-np 4 helloms
2 of 4
3 of 4
1 of 4
0, p = 4
from process 1!
from process 2!
from process 3!
$ # %1? @ 1
" %
!
9<
&DB
+&8
7 $6 ? !1 $&8
/ $8
DD& 5
B77
9
9
9<
+ :
&8
A
A
9 # "#
= %&8
A
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
&8
Linear Model to Predict
Communication Performance
Time to send N bytes= τ n + β
0,9
1
0,8
0,1
7E-07 n + 0,0003
0,7
0,01
BEOULL
CRAYT3E
0,001
0,6
0,5
BEOULL
0,4
CRAYT3E
0,3
0,0001
0,2
25
6
10
24
40
96
16
38
4
65
53
26 6
21
10 44
48
57
6
64
16
4
1
0,00001
5E-08 n + 5E-05
0,1
0
/
/
/
(
( /
/
/
//
/
/
/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
/
/
Performace Prediction: Fast,
Gigabit Ethernet, Myrinet
0,1
0,1
y = 9E-08x + 0,0007
0,09
0,09
0,08
0,08
0,07
y = 5E-08x + 0,0003
0,06
0,05
0,04
0,03
0,02
y = 4E-09x + 2E-05
0,01
Fast
Giga
Myrinet
Lineal (Fast)
Lineal (Giga)
Lineal (Myrinet)
0,07
0,06
Fast
Giga
Myrinet
0,05
0,04
0,03
0,02
0,01
0
0
0
500000
1000000
1
100
10000
1000000
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Basic Communication Operations
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
One-to-all broadcast Single-node
Accumulation
-0 +
# +
---
--"*0& 0 #
$ $ +
*0
"+
"+
--
"+
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Broadcast on Hypercubes
#
F
$
E
I
(
F
H
I
G
7
(
B
E
I
(
7
G
H
E
I
(
B
B
$
F
G
H
7
!
F
E
H
G
7
B
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Broadcast on Hypercubes
#!
F
E
I
(
F
H
G
7
$
E
I
(
B
H
G
7
B
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI Broadcast
int MPI_Bcast(
void *buffer;
int count;
MPI_Datatype datatype;
int root;
MPI_Comm comm;
);
Broadcasts a message from the
process with rank "root" to
all other processes of the group
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Reduction on Hypercubes
@
conmutative
associative operator
and
Ai in processor i
Every processor has
obtain A0@A1@...@AP-1
(
to
/
1
1
1
(1
1
1
1
1
/
(
1
1
1
/1
1
(
1
(
1
1
/1
1
1
1
(
1
1
/1
1
1
(
1
1
1
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Reductions with MPI
int MPI_Reduce(
void
void
int
MPI_Datatype
MPI_Op
int
MPI_Comm
);
' #$ 2 $ 0
+ *0& 2 $
int MPI_Allreduce(
*sendbuf;
*recvbuf;
count;
datatype;
op;
root;
comm;
;
);
*0
$+
2 $
0# #*+*$+
3+
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
+
All-To-All Broadcast
Multinode Accumulation
+
# +
--"*0& 0 #
--$ $ +
*0
' #$ +
*0
*4 $
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI Collective Operations
MPI Operator
Operation
--------------------------------------------------------------MPI_MAX
maximum
MPI_MIN
minimum
MPI_SUM
sum
MPI_PROD
product
MPI_LAND
logical and
MPI_BAND
bitwise and
MPI_LOR
logical or
MPI_BOR
bitwise or
MPI_LXOR
logical exclusive or
MPI_BXOR
bitwise exclusive or
MPI_MAXLOC
max value and location
MPI_MINLOC
min value and location
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing π: Sequential
1
π6
4 dx
(1+x2)
0
! ,"
$#677 J8
"
# 68
! ," " " $#6 778
6 B7 K%! ," & 8
%#6 78 #/ 8 #DD& 5
) 6 %#D 7G& 2 8
$#D6 %)&8
A
$#26 8
5
5
5
5
5
5
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing π: Parallel
1
π6
4 dx
(1+x2)
9
0
%:
B
9
9
6 B7 K%! ," & 8
L$#6 778
%#6
8 #/ 8 #D6
) 6 2 %#D 7G& 2 8
L$#
D6 %)&8
7
9<
$
+&8
&5
A
L$#6 2
9
5
5
5
5
5
!
8
%: L$#:$#B
9
7
9
9+
5
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
9<
+&8
The Master Slave Paradigm
" '
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Condor
University Wisconsin-Madison. www.cs.wisc.edu/condor
A problem to be solved many times over several
different inputs.
The problem
expensive.
to
be
solved
is
computational
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Condor
Condor is a specialized workload management system for computeintensive jobs.
Like other full-featured batch systems, Condor provides a job queueing
mechanism, scheduling policy, priority scheme, resource monitoring,
and resource management.
Users submit their serial or parallel jobs to Condor, Condor places them
into a queue, chooses when and where to run the jobs based upon a
policy, carefully monitors their progress, and ultimately informs the user
upon completion.
Condor can be used to manage a cluster of dedicated compute nodes
(such as a "Beowulf" cluster). In addition, unique mechanisms enable
Condor to effectively harness wasted CPU power from otherwise idle
desktop workstations.
In many circumstances Condor is able to transparently produce a
checkpoint and migrate a job to a different machine which would
otherwise be idle.
As a result, Condor can be used to seamlessly combine all of an
organization's computational power into one resource.
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
What is OpenMP?
Application Program Interface (API) for shared
memory parallel programming
What the application programmer inserts into code to make it
run in parallel
Addresses only shared memory multiprocessors
Directive based approach with library support
Concept of base language and extensions to base
language
OpenMP is available for Fortran 90 / 77 and C / C++
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP is not...
Not Automatic parallelization
User explicitly specifies parallel execution
Compiler does not ignore user directives even if wrong
Not just loop level parallelism
Functionality to enable coarse grained parallelism
Not a research project
Only practical constructs that can be implemented with high
performance in commercial compilers
Goal of parallel programming: application speedup
Simple/Minimal with Opportunities for Extensibility
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Why OpenMP?
Parallel programming landscape before OpenMP
Standard way to program distributed memory computers
(MPI and PVM)
No standard API for shared memory programming
Several vendors had directive based APIs for shared
memory programming
Silicon Graphics, Cray Research, Kuck & Associates, Digital
Equipment ….
All different, vendor proprietary
Sometimes similar but with different spellings
Most were targeted at loop level parallelism
Limited functionality - mostly for parallelizing loops
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Why OpenMP? (cont.)
Commercial users, high end software vendors have
big investment in existing code
Not very eager to rewrite their code in new languages
Performance concerns of new languages
End result: users who want portability forced to
program shared memory machines using MPI
Library based, good performance and scalability
But sacrifice the built in shared memory advantages of the
hardware
Both require major investment in time and money
Major effort: entire program needs to be rewritten
New features needs to be curtailed during conversion
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP API
Multi-platform shared-memory parallel programming
OpenMP is portable: supported by Compaq, HP, IBM, Intel,
SGI, Sun and others on Unix and NT
Multi-Language: C, C++, F77, F90
Scalable
Loop Level parallel control
Coarse grained parallel control
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP API
Single source parallel/serial programming:
OpenMP is not intrusive (to original serial code).
Instructions appear in comment statements for Fortran and
through pragmas for C/C++
!$omp parallel do
do i = 1, n
...
enddo
#pragma omp parallel for
for (i = 0; I < n; i++) {
...
}
Incremental implementation:
OpenMP programs can be implemented incrementally one
subroutine (function) or even one do (for) loop at a time
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Threads
Multithreading:
Sharing a single CPU between multiple tasks (or "threads") in
a way designed to minimise the time required to switch
threads
This is accomplished by sharing as much as possible of the
program execution environment between the different threads
so that very little state needs to be saved and restored when
changing thread.
Threads share more of their environment with each other
than do tasks under multitasking.
Threads may be distinguished only by the value of their
program counters and stack pointers while sharing a single
address space and set of global variables
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP Overview: How do threads
interact?
OpenMP is a shared memory model
Threads communicate by sharing variables
Unintended sharing of data causes race conditions:
race condition: when the program’s outcome changes
as the threads are scheduled differently
To control race conditions:
Use synchronizations to protect data conflicts
Synchronization is expensive so:
Change how data is accessed to minimize the need for
synchronization
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
System layer
Prog. Layer
(OpenMP API)
User layer
OpenMP Parallel Computing Solution
Stack
End User
Application
Directives
OpenMP Library
Environment
Variables
Runtime Library
OS/system support for shared memory
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Reasoning about programming
Programming is a process of successive refinement
of a solution relative to a hierarchy of models
The models represent the problem at a different level
of abstraction
The top levels express the problem in the original problem
domain
The lower levels represent the problem in the computer’s
domain
The models are informal, but detailed enough to
support simulation
Source: J.-M. Hoc, T.R.G. Green, R. Samurcay and D.J. Gilmore
(eds.), Psychology of Programming, Academic Press Ltd., 1990
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Layers of abstraction in Programming
Domain
Model: Bridges between domains
Problem
Specification
Algorithm
Programming
Source Code
Computational
OpenMP
only defines
these two!
Computation
Cost
Hardware
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP Computational Model
OpenMP was created with a particular abstract machine
or computational model in mind:
Multiple processing elements
A shared address space with “equal-time” access for each
processor
Multiple light weight processes (threads) managed outside of
OpenMP (the OS or some other “third party”)
Proc
Proc
1
Proc
2
3
...
Proc
N
Shared Address Space
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP programming model
fork-join parallelism:
Master thread spawns a team of threads as needed
Parallelism is added incrementally: i.e. the sequential
program evolves into a parallel program
Master
Thread
Parallel Regions
A nested
Parallel
Region
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
So, How good is OpenMP?
A high quality programming environment supports
transparent mapping between models
OpenMP does this quite well for the models it
defines:
Programming model:
Threads forked by OpenMP map onto threads
in modern OSs.
Computational model:
Multiprocessor systems with cache coherency
map onto OpenMP shared address space
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
What about the cost model?
OpenMP doesn’t say much about the cost model
programmers are left to their own devices
Real systems have memory hierarchies, OpenMP’s
assumed machine model doesn’t:
Caches mean some data is closer to some processors
Scalable multiprocessor systems organize their RAM into
modules - another source of NUMA
OpenMP programmers must deal with these issues
as they:
Optimize performance on each platform
Scale to run onto larger NUMA machines
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
What about the specification model?
Programmers reason in terms of a specification
model as they design parallel algorithms
Some parallel algorithms are natural in OpenMP:
Specification models implied by loop-splitting and SPMD
algorithms map well onto OpenMP’s programming model
Some parallel algorithms are hard for OpenMP
Recursive problems and list processing is challenging
for OpenMP’s models
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Is OpenMP a “good” API?
Model: Bridges between domains
Specification
Fair (5)
Programming
Good (8)
Computational
Good (9)
Cost
Poor (3)
Overall score: OpenMP is “OK” (6), but it sure could be
better!
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP today
Hardware Vendors
3rd Party Software Vendors
Compaq/Digital (DEC)
Absoft
Hewlett-Packard (HP)
Edinburgh Portable Compilers (EPC)
Kuck & Associates (KAI)
IBM
Myrias
Intel
Numerical Algorithms Group (NAG)
Silicon Graphics
Portland Group (PGI)
Sun Microsystems
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP components
Directives
Environment Variables
Shared / Private Variables
Runtime Library
OS ‘Threads’
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP directives
Directives are special comments in the language
Fortran fixed form: !$OMP, C$OMP, *$OMP
Fortran free form: !$OMP
Special comments are interpreted by OpenMP
compilers
#
,
# $ !
,L $
$#
"
w = 1.0/n
sum = 0.0
!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum)
do I=1,n
x = w*(I-0.5)
sum = sum + f(x)
end do
pi = w*sum
print *,pi
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP directives
Look like a comment (sentinel / pragma syntax)
F77/F90 !$OMP directive_name [clauses]
C/C++ #pragma omp pragmas_name [clauses]
Declare start and end of multithread execution
Control work distribution
Control how data is brought into and taken out of
parallel sections
Control how data is written/read inside sections
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The OpenMP environment variables
OMP_NUM_THREADS - number of threads to run in
a parallel section
MPSTKZ – size of stack to provide for each thread
OMP_SCHEDULE - Control state of scheduled
executions.
setenv OMP_SCHEDULE "STATIC, 5“
setenv OMP_SCHEDULE "GUIDED, 8“
setenv OMP_SCHEDULE "DYNAMIC"
OMP_DYNAMIC
OMP_NESTED
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Shared / Private variables
Shared variables can be accessed by all of the
threads.
Private variables are local to each thread
In a ‘typical’ parallel loop, the loop index is private,
while the data being indexed is shared
!$omp parallel
!$omp parallel do
!$omp& shared(X,Y,Z), private(I)
do I=1, 100
Z(I) = X(I) + Y(I)
end do
!$omp end parallel
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP Runtime routines
Writing a parallel section of code is matter of asking
two questions
How many threads are working in this section?
Which thread am I?
Other things you may wish to know
How many processors are there?
Am I in a parallel section?
How do I control the number of threads?
Control the execution by setting directive state
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Os ‘threads’
In the case of Linux, it needs to be installed with an
SMP kernel.
Not a good idea to assign more threads than CPUs
available:
omp_set_num_threads(omp_get_num_procs())
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
A simple example: computing
π=
+
≅
=
( + (( +
5/)
π
)
)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing
π
double t, pi=0.0, w;
long i, n = 100000000;
double local, pi = 0.0, w = 1.0 / n;
...
for(i = 0; i < n; i++) {
t = (i + 0.5) * w;
pi += 4.0/(1.0 + t*t);
}
pi *= w;
...
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing
π
#pragma omp directives in C
Ignored by non-OpenMP compilers
double t, pi=0.0, w;
long i, n = 100000000;
double local, pi = 0.0, w = 1.0 / n;
...
#pragma omp parallel for reduction(+:pi) private(i,t)
for(i = 0; i < n; i++) {
t = (i + 0.5) * w;
pi += 4.0/(1.0 + t*t);
}
pi *= w;
...
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computing
π on a SunFire 6800
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Compiling OpenMP programs
OpenMP directives are ignored by default
Example: SGI Irix platforms
f90 -O3 foo.f
cc -O3 foo.c
OpenMP directives are enabled with “-mp”
Example: SGI Irix platforms
f90 -O3 -mp foo.f
cc -O3 -mp foo.c
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fortran example
OpenMP directives used:
program f77_parallel
implicit none
c$omp parallel [clauses]
c$omp end parallel
integer n, m, i, j
parameter (n=10, m=20)
integer a(n,m)
Parallel clauses include:
c$omp parallel default(none)
default(none|private|shared)
private(...)
c$omp& private(i,j) shared(a)
shared(...)
do j=1,m
do i=1,n
a(i,j)=i+(j-1)*n
enddo
enddo
c$omp end parallel
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fortran example
program f77_parallel
implicit none
integer n, m, i, j
parameter (n=10, m=20)
integer a(n,m)
c$omp parallel default(none)
c$omp& private(i,j) shared(a)
do j=1,m
do i=1,n
a(i,j)=i+(j-1)*n
enddo
enddo
c$omp end parallel
Each arrow
denotes
one thread
All threads
perform
identical
task
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fortran example
With default scheduling,
Thread a works on j = 1:5
Thread b on j = 6:10
Thread c on j = 11:15
Thread d on j = 16:20
program f77_parallel
implicit none
integer n, m, i, j
parameter (n=10, m=20)
integer a(n,m)
c$omp parallel default(none)
c$omp& private(i,j) shared(a)
do j=1,m
do i=1,n
a(i,j)=i+(j-1)*n
enddo
enddo
c$omp end parallel
a
b
c
d
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The Foundations of OpenMP:
OpenMP: a parallel programming API
Parallelism
Working with
concurrency
Layers of abstractions
or models used to
understand and use
OpenMP
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Summary of OpenMP Basics
Parallel Region (to create threads)
C$omp parallel
#pragma omp parallel
Worksharing (to split-up work between threads)
C$omp do
#pragma omp for
C$omp sections
#pragma omp sections
C$omp single
#pragma omp single
C$omp workshare
Data Environment (to manage data sharing)
# directive: threadprivate
# clauses: shared, private, lastprivate, reduction, copyin, copyprivate
Synchronization
directives: critical, barrier, atomic, flush, ordered, master
Runtime functions/environment variables
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Improvements in black hole
detection using parallelism
%
#
$ !!$
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introduction (1/3)
Very frequently there is a divorce between computer
scientists and researchers in other scientific
disciplines
This work collects the experiences of a collaboration
between researchers coming from two different fields:
astrophysics and parallel computing
We present different approaches to the parallelization
of a scientific code that solves an important problem
in astrophysics:
the detection of supermassive black holes
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introduction (2/3)
The IAC co-authors initially developed a Fortran77
code solving the problem
The execution time for this original code was not
acceptable and that motivated them to contact with
researchers with expertise in the parallel computing
field
We know in advance that these scientist
programmers deal with intense time-consuming
sequential codes that are not difficult to tackle using
HPC techniques
Researchers with a purely scientific background are
interested in these techniques, but they are not
willing to spend time learning about them
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introduction (3/3)
One of our constraints was to introduce the minimum
amount of changes in the original code
Even with the knowledge that some optimizations
could be done in the sequential code
To preserve the use of the NAG functions was also
another restriction in our development
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Outline
The Problem
Black holes and quasars
The method: gravitational lensing
Fluctuations in quasars light curves
Mathematical formulation of the problem
Sequential code
Parallelizations: MPI, OpenMP, Mixed MPI/OpenMP
Computational results
Conclusions
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Black holes
Supermassive black holes (SMBH) are supposed to
exist in the nucleus of many if not all the galaxies
Some of these objects are surrounded by a disk of
material continuously spiraling towards the deep
gravitational potential pit of the SMBH and releasing
huge quantities of energy giving rise to the
phenomena known as quasars (QSO)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Quasars (Quasi Stellar Objects, QSO)
QSOs are currently believed to be the most luminous
and distant objects in the universe
QSOs are the cores of massive galaxies with super
giant black holes that devour stars at a rapid rate,
enough to produce the amount of energy observed by
a telescope
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The method
We are interested in objects of dimensions
comparable to the Solar System in galaxies very far
away from the Milky Way
Objects of this size can not be directly imaged and
alternative observational methods are used to study
their structure
The method we use is the observation of QSO
images affected by a microlensing event to study the
structure of the accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Gravitational Microlensing
Gravitational lensing (the attraction of light by matter)
was
predicted
by
General
Relativity and
observationally confirmed in 1919
If light from a QSO pass through a galaxy located
between the QSO and the observer it is possible that
a star in the intervening galaxy crosses the QSO light
beam
The gravitational field of the star amplifies the light
emission coming from the accretion disk
(gravitational microlensing)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Microlensing Quasar-Star
MACROIMAGES
MICROIMAGES
QUASAR
OBSERVER
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Microlensing
The phenomenon is more complex because the
magnification is not due to a single isolated
microlens, but it rather is a collective effect of many
stars
As the stars are moving with respect to the QSO
light beam, the amplification varies during the
crossing
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Microlensing
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Double Microlens
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Multiple Microlens
Magnification pattern in the source plane, produced by a dense field
of stars in the lensing galaxy.
The color reflects the magnification as a function of the quasar
position: the sequence blue-green-red-yellow indicates increasing
magnification
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305 (1/2)
So far the best example of a microlensed quasar is
the quadruple quasar Q2237+0305
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305 (2/2)
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fluctuations in Q2237+0305 light
curves
Lightcurves of the four
images of Q2237+0305
over a period of almost
ten years
The changes in relative
brightness
are
very
obvious
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305
In Q2237+0305, and thanks to the unusually small
distance between the observer and the lensing
galaxy, the microlensing events have a timescale of
the order of months
We have observed Q2237+0305 from 1999 October
during approximately 4 months at the Roque de los
Muchachos Observatory
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Fluctuations in light curves
The curve representing the change in luminosity of
the QSO with time depends on the position of the
star and also on the structure of the accretion disk
Microlens-induced fluctuations in the observed
brightness of the quasar contain information about
the light-emitting source (size of continuum region or
broad line region of the quasar, its brightness profile,
etc.)
Hence from a comparison between observed and
simulated quasar microlensing we can draw
conclusions about the accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Q2237+0305 light curves (2/2)
Our goal is to model light curves of QSO images
affected by a microlensing event to study the
unresolved structure of the accretion disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Mathematical formulation (1/5)
Leaving aside the physical meaning of the different
variables, the function modeling the dependence of
the observed flux
with time t, can be written as:
Where
is the ratio between the outer and inner
radii of the accretion disk (we will adopt
).
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Mathematical formulation (2/5)
And G is the function:
To speed up the computation, G has
approximated by using MATHEMATICA
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
been
Mathematical formulation (3/5)
Our goal is to estimate in the observed flux
the values of the parameters
, B , C,
:
, t0
by fitting to the observational data
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Mathematical formulation (4/5)
Specifically, to find the values of the 5 parameters that
minimize the error between the theoretical model and
the observational data according to a chi-square
criterion:
Where:
N is the number of data points
corresponding
to times ti (i =1, 2, ..., N)
is the theoretical function evaluated at time ti
is the observational error associated to each data value
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Mathematical formulation (5/5)
To minimize
we use the e04ccf NAG routine,
that only requires evaluation of the function and not
of the derivatives
The determination of the minimum in the
5-parameters space depends on the initial conditions,
so
we consider a 5-dimensional grid of starting points
and m sampling intervals in each variable
for each one of the points of the this grid we compute
a local minimum
Finally, we select the absolute minimum among them
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The sequential code
program seq_black_hole
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, length
double precision t, fit(5)
common/var/t, fit
c
c
c
Data input
Initialize best solution
do k1=1, m
do k2=1, m
do k3=1, m
do k4=1, m
do k5=1, m
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
enddo
enddo
enddo
enddo
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The sequential code
program seq_black_hole
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, length
double precision t, fit(5)
common/var/t, fit
c
c
c
Data input
Initialize best solution
do k1=1, m
do k2=1, m
do k3=1, m
do k4=1, m
do k5=1, m
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
enddo
enddo
enddo
enddo
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Sequential Times
In a Sun Blade 100 Workstation running Solaris 5.0
and using the native Fortran77 Sun compiler (v. 5.0)
with full optimizations this code takes
5.89 hours for sampling intervals of size m=4
12.45 hours for sampling intervals of size m=5
In a SGI Origin 3000 using the native MIPSpro F77
compiler (v. 7.4) with full optimizations
0.91 hours for m=4
2.74 hours for m=5
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Loop transformation
program seq_black_hole2
implicit none
parameter(m = 4)
...
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud
double precision t, fit(5)
integer k1, k2, k3, k4, k5
common/var/t, fit
c
c
c
c
Data input
Initialize best solution
do k = 1, m^5
Index transformation
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
end seq_black_hole
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI & OpenMP (1/2)
In the last years OpenMP and MPI have been
universally accepted as the standard tools to develop
parallel applications
OpenMP
is a standard for shared memory programming
It uses a fork-join model and
is mainly based on compiler directives that are added to the code
that indicate the compiler regions of code to be executed in parallel
MPI
Uses an SPMD model
Processes can read and write only to their respective local memory
Data are copied across local memories using subroutine calls
The MPI standard defines the set of functions and procedures
available to the programmer
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI & OpenMP (2/2)
Each one of these two alternatives have both
advantages and disadvantages
Very frequently it is not obvious which one should be
selected for a specific code:
MPI programs run on both distributed and shared memory
architectures while OpenMP run only on shared memory
The abstraction level is higher in OpenMP
MPI is particularly adaptable to coarse grain parallelism.
OpenMP is suitable for both coarse and fine grain parallelism
While it is easy to obtain a parallel version of a sequential code in
OpenMP, usually it requires a higher level of expertise in the case
of MPI. A parallel code can be written in OpenMP just by inserting
proper directives in a sequential code, preserving its semantics,
while in MPI mayor changes are usually needed
Portability is higher in MPI
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI-OpenMP hybridization (1/2)
Hybrid codes match the architecture of SMP clusters
SMP clusters are an increasingly popular platforms
MPI may suffer from efficiency problems in shared
memory architectures
MPI codes may need too much memory
Some vendors have attempted to exted MPI in
shared memory, but the result is not as efficient as
OpenMP
OpenMP is easy to use but it is limited to the shared
memory architecture
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI-OpenMP hybridization (2/2)
An hybrid code may provide better scalability
Or simply enable a problem to exploit more
processors
Not necessarily faster than pure MPI/OpenMP
It depends on the code, architecture, and how the
programming models interact
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI code
program black_hole_mpi
include 'mpif.h'
implicit none
parameter(m = 4)
...
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit
c
c
c
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
Data input
Initialize best solution
do k = myid, m^5 - 1, numprocs
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
call MPI_FINALIZE( ierr )
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI code
program black_hole_mpi
include 'mpif.h'
implicit none
parameter(m = 4)
...
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit
c
c
c
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
Data input
Initialize best solution
do k = myid, m^5 - 1, numprocs
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
call MPI_FINALIZE( ierr )
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP code
program black_hole_omp
implicit none
parameter(m = 4)
...
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit
!$OMP THREADPRIVATE(/var/, /data/)
c
Data input
c
Initialize best solution
!$OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x)
COPYIN(/data/) LASTPRIVATE(fx)
do k = 0, m^5 - 1
c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
!$OMP END PARALLEL DO
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP code
program black_hole_omp
implicit none
parameter(m = 4)
...
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit
!$OMP THREADPRIVATE(/var/, /data/)
c
Data input
c
Initialize best solution
!$OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x)
COPYIN(/data/) LASTPRIVATE(fx)
do k = 0, m^5 - 1
c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
!$OMP END PARALLEL DO
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP code
program black_hole_omp
implicit none
parameter(m = 4)
...
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit
!$OMP THREADPRIVATE(/var/, /data/)
c
Data input
c
Initialize best solution
!$OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x)
COPYIN(/data/) LASTPRIVATE(fx)
do k = 0, m^5 - 1
c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
!$OMP END PARALLEL DO
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Hybrid MPI – OpenMP code
program black_hole_mpi_omp
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, length
double precision t, fit(5)
common/var/t, fit
!$OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr)
c
Data input
c
Initialize best solution
!$OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx)
do k = myid, m^5 - 1, mpi_numprocs
c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
c
Reduce the OpenMP best solution
!$OMP END PARALLEL DO
c
Reduce the MPI best solution
call MPI_FINALIZE(ierr)
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Hybrid MPI – OpenMP code
program black_hole_mpi_omp
double precision t2(100), s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, length
double precision t, fit(5)
common/var/t, fit
!$OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs, ierr)
c
Data input
c
Initialize best solution
!$OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx)
do k = myid, m^5 - 1, mpi_numprocs
c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx)
call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,...)
if (fx improves best fx) then
update(best (x, fx))
endif
enddo
c
Reduce the OpenMP best solution
!$OMP END PARALLEL DO
c
Reduce the MPI best solution
call MPI_FINALIZE(ierr)
end
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (1/2)
jen50.ciemat.es
160 processors MIPS R14000 / 600MHz
On 40 nodes with 4 processors each
Data and instruction cache on-chip
Irix Operating System
Hypercubic Network
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
The SGI Origin 3000 Architecture (2/2)
cc-Numa memory Architecture
1 Gflops Peak Speed
8 MB external cache
1 Gb main memory each proc.
1 Tb Hard Disk
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computational results (m=4)
2
4
8
16
32
Time (secs.)
MPI
OMP MPI-OMP
1674.9 1670.5
1671.6
951.1 930.6
847.4
496.1 485.7
481.9
257.0 257.2
255.3
133.9 134.0
133.7
Speedup
MPI
OMP MPI-OMP
2.0
2.0
2.0
3.4
3.5
3.9
6.6
6.7
6.8
12.7 12.7
12.8
24.4 24.4
24.4
As we do not have exclusive mode access to the architecture, the
times correspond to the minimum time from 5 different executions
The figures corresponding to the mixed mode code (label MPIOpenMP) correspond to the minimum times obtained for different
combinations of MPI processes/OpenMP threads
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Parallel execution time (m=4)
Time (secs.)
1800
MPI
OpenMP
MPI-OpenMP
1600
1400
1200
Time
1000
800
600
400
200
0
0
5
10
15
20
25
30
35
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speedup (m=4)
Speedup
25
MPI
OpenMP
MPI-OpenMP
20
Speedup
15
10
5
0
0
5
10
15
20
25
30
35
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Results for mixed MPI-OpenMP
Time (sec.)
1800
1600
1400
1200
1 Thread
1000
800
4 Threads
8 Threads
600
400
200
16 Threads
32 Threads
2 Threads
0
1
32
16
83
4
2 5
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Computational results (m=5)
2
4
8
16
32
Time (secs.)
MPI
OMP MPI-OMP
4957.0 5085.9
4973.1
2504.9 2614.5
2513.0
1261.2 1372.4
1265.1
640.3 755.8
642.9
338.1 400.1
339.2
Speedup
MPI
OMP MPI-OMP
2.0
1.9
2.0
3.9
3.8
3.9
7.8
7.2
7.8
15.4 13.1
15.4
29.2 24.7
29.1
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Parallel execution time (m=5)
Time (secs.)
5500
MPI
OpenMP
MPI-OpenMP
5000
4500
4000
3500
Time
3000
2500
2000
1500
1000
500
0
0
5
10
15
20
25
30
35
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Speedup (m=5)
Speedup
30
MPI
OpenMP
MPI-OpenMP
25
Speedup
20
15
10
5
0
0
5
10
15
20
25
30
35
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Time (secs.)
Results for mixed MPI-OpenMP (m=5)
1800
1600
1400
1200
1000
800
600
400
200
0
1 Thread
2 Threads
4 Threads
8 Threads
16 Threads
32 Threads
32
1
16
2
83
44
25
Processors
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions (1/3)
The computational results obtained from all the
parallel versions confirm the robustness of the
method
For the case of non-expert users and the kind of
codes we have been dealing with, we believe that
MPI parallel versions are easier and safer
In the case of OpenMP, the proper usage of data
scope attributes for the variables involved may be a
handicap for users with non-parallel programming
expertise
The higher current portability of the MPI version is
another factor to be considered
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions (2/3)
The mixed MPI/OpenMP parallel version is the most
expertise-demanding
Nevertheless, as it has been stated by several
authors and even for the case of a hybrid
architecture, this version does not offer any clear
advantage and it has the disadvantage that the
combination of processes/threads has to be tuned
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions (3/3)
We conclude a first step of cooperation in the way of
applying HPC techniques to improve performance in
astrophysics codes
The
scientific
aim
of
applying
HPC
to
computationally-intensive codes in astrophysics has
been successfully achieved
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Conclusions and Future Work
The relevance of our results do not come directly
from the particular application chosen, but from
stating that parallel computing techniques are the key
to broach large scale real problems in the mentioned
scientific field
From now on we plan to continue this fruitful
collaboration by applying parallel computing
techniques to some other astrophysical challenge
problems
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Supercomputing Centers
http://www.ciemat.es/
http://www.cepba.upc.es/
http://www.epcc.ed.ac.uk/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
MPI links
Message Passing Interface Forum
http://www.mpi-forum.org/
MPI: The Complete Reference
http://www.netlib.org/utk/papers/mpibook/mpi-book.html
Parallel Programming with MPI By Peter
Pacheco. http://www.cs.usfca.edu/mpi/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
OpenMP links
http://www.openmp.org/
http://www.compunity.org/
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Introducción a la computación
de altas prestaciones
!
"
#$%
!!"
Introducción a la Computación de Altas Prestaciones. La Laguna, 12 de febrero de 2004
Documentos relacionados
Descargar