BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:D171/173
DTSTART;TZID=America/Chicago:20181116T111500
DTEND;TZID=America/Chicago:20181116T113000
UID:submissions.supercomputing.org_SC18_sess145_ws_p3hpc112@linklings.com
SUMMARY:Heterogeneous CPU-GPU Execution of Stencil Applications
DESCRIPTION:Workshop\nHeterogeneous Systems, Performance, Workshop Reg Pas
 s\n\nHeterogeneous CPU-GPU Execution of Stencil Applications\n\nSiklosi, R
 eguly, Mudalige\n\nHeterogeneous computer architectures are now ubiquitous
  in high performance computing; the top 7 supercomputers are all built wit
 h CPUs and accelerators. Portability across different CPUs and GPUs is bec
 oming paramount, and heterogeneous scheduling of computations is also of i
 ncreasing interest to make full use of these systems. In this paper, we pr
 esent research on the hybrid CPU-GPU execution of an important class of ap
 plications: structured mesh stencil codes. Our work broadens the performan
 ce portability capabilities of the Oxford Parallel library for Structured 
 meshes (OPS), which allows a science code written once at a high level to 
 be automatically parallelised for a range of different architectures. We e
 xplore the traditional per-loop load balancing approach used by others, an
 d highlighting its shortcomings, we develop an algorithm that relies on po
 lyhedral analysis and transformations in OPS to allow load balancing on th
 e level of larger computational stages, reducing data transfer requirement
 s and synchronisation points.\n\nWe evaluate our algorithms on a simple he
 at equation benchmark, as well as a substantially more complex code, the C
 loverLeaf hydrodynamics mini-app. To demonstrate performance portability, 
 we study Intel and IBM systems equipped with NVIDIA Kepler, Pascal, and Vo
 lta GPUs, evaluating CPU-only, GPU-only and hybrid CPU-GPU performance. We
  demonstrate a 1.05-1.2x speedup on CloverLeaf. Our results highlight the 
 ability of the OPS domain specific language to deliver effortless performa
 nce portability for its users across a number of platforms.
URL:https://sc18.supercomputing.org/presentation/?id=ws_p3hpc112&sess=sess
 145
END:VEVENT
END:VCALENDAR

