BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160727Z
LOCATION:D172
DTSTART;TZID=America/Chicago:20181112T103000
DTEND;TZID=America/Chicago:20181112T105500
UID:submissions.supercomputing.org_SC18_sess168_ws_ia108@linklings.com
SUMMARY:Software Prefetching for Unstructured Mesh Applications
DESCRIPTION:Workshop\nArchitectures, Data Analytics, Graph Algorithms, Wor
 kshop Reg Pass\n\nSoftware Prefetching for Unstructured Mesh Applications\
 n\nHadade, Jones, Wang, di Mare\n\nApplications that exhibit regular memor
 y access patterns usually benefit transparently from hardware prefetchers 
 that bring data into the fast on-chip cache just before it is required, th
 ereby avoiding expensive cache misses. In contrast, unstructured mesh appl
 ications contain irregular access patterns that are often more difficult t
 o identify in hardware. An alternative for such workloads is software pref
 etching, where special non-blocking instructions load data into the cache 
 hierarchy. However, there are currently few examples in the literature on 
 how to incorporate such software prefetches into existing applications wit
 h positive results.\n\nThis paper addresses these issues by demonstrating 
 the utility and implementation of software prefetching in an unstructured 
 finite volume CFD code of representative size and complexity to an industr
 ial application and across a number of processors. We present the benefits
  of auto-tuning for finding the optimal prefetch distance values across di
 fferent computational kernels and architectures and demonstrate the import
 ance of choosing the right prefetch destination across the available cache
  levels for best performance. We discuss the impact of the data layout on 
 the number of prefetch instructions required in kernels with indirect-acce
 ss patterns and show how to integrate them on top of existing optimization
 s such as vectorization. Through this we show significant full application
  speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (1
 5%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99X) archi
 tecture and the out-of-order Knights Landing (33%) many-core processor.
URL:https://sc18.supercomputing.org/presentation/?id=ws_ia108&sess=sess168
END:VEVENT
END:VCALENDAR