BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:D174
DTSTART;TZID=America/Chicago:20181116T113000
DTEND;TZID=America/Chicago:20181116T115000
UID:submissions.supercomputing.org_SC18_sess146_ws_ftxs117@linklings.com
SUMMARY:SaNSA - the Supercomputer and Node State Architecture
DESCRIPTION:Workshop\nResiliency, Scientific Computing, Workshop Reg Pass\
 n\nSaNSA - the Supercomputer and Node State Architecture\n\nAgarwal, Green
 berg, Blanchard, DeBardeleben\n\nIn this work, we present SaNSA, the Super
 computer and Node State Architecture, a software infrastructure for histor
 ical analysis and anomaly detection. SaNSA consumes data from multiple sou
 rces including system logs, the resource manager, scheduler, and job logs.
  Furthermore, additional context such as scheduled maintenance events or d
 edicated application run times for specific science teams can be overlaid.
  We discuss how this contextual information allows for more nuanced analys
 is. SaNSA allows the user to apply arbitrary attributes, for instance, pos
 itional information where nodes are located in a data center. We show how 
 using this information we identify anomalous behavior of one rack of a 1,5
 00 node cluster. We explain the design of SaNSA and then test it on four o
 pen compute clusters at LANL. We ingest over 1.1 billion lines of system l
 ogs in our study of 190 days in 2018. Using SaNSA, we perform a number of 
 different anomaly detection methods and explain their findings in the cont
 ext of a production supercomputing data center. For example, we report on 
 instances of misconfigured nodes which receive no scheduled jobs for a per
 iod of time as well as examples of correlated rack failures which cause jo
 bs to crash.
URL:https://sc18.supercomputing.org/presentation/?id=ws_ftxs117&sess=sess1
 46
END:VEVENT
END:VCALENDAR

