BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:D174
DTSTART;TZID=America/Chicago:20181116T103000
DTEND;TZID=America/Chicago:20181116T105000
UID:submissions.supercomputing.org_SC18_sess146_ws_ftxs113@linklings.com
SUMMARY:Analyzing the Impact of System Reliability Events on Applications 
 in the Titan Supercomputer
DESCRIPTION:Workshop\nResiliency, Scientific Computing, Workshop Reg Pass\
 n\nAnalyzing the Impact of System Reliability Events on Applications in th
 e Titan Supercomputer\n\nAshraf, Engelmann\n\nExtreme-scale computing syst
 ems employ Reliability, Availability and Serviceability (RAS) mechanisms a
 nd infrastructure to log events from multiple system components. In this p
 aper, we analyze RAS logs in conjunction with the application placement an
 d scheduling database, in order to understand the impact of common RAS eve
 nts on application performance. This study conducted on the records of abo
 ut 2 million applications executed on Titan supercomputer provides importa
 nt insights for system users, operators and computer science researchers. 
 We investigate the impact of RAS events on application performance and its
  variability by comparing cases where events are recorded with correspondi
 ng cases where no events are recorded. Such a statistical investigation is
  possible since we observed that system users tend to execute their applic
 ations multiple times. Our analysis reveals that most RAS events do impact
  application performance, although not always. We also find that different
  system components affect application performance differently. In particul
 ar, our investigation includes the following components: parallel file sys
 tem, processor, memory, graphics processing units, system and user softwar
 e issues. Our work establishes the importance of providing feedback to sys
 tem users for increasing operational efficiency of extreme-scale systems.
URL:https://sc18.supercomputing.org/presentation/?id=ws_ftxs113&sess=sess1
 46
END:VEVENT
END:VCALENDAR

