BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160728Z
LOCATION:D220
DTSTART;TZID=America/Chicago:20181112T143000
DTEND;TZID=America/Chicago:20181112T150000
UID:submissions.supercomputing.org_SC18_sess172_ws_phpsc104@linklings.com
SUMMARY:Performance, Power, and Scalability Analysis of the Horovod Implem
 entation of the CANDLE NT3 Benchmark on the Cray XC40 Theta
DESCRIPTION:Workshop\nParallel Application Frameworks, Reproducibility, Sc
 ientific Computing, Workshop Reg Pass\n\nPerformance, Power, and Scalabili
 ty Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on t
 he Cray XC40 Theta\n\nWu, Taylor, Wozniak, Stevens, Brettin...\n\nTraining
  scientific deep learning models requires the large amount of computing po
 wer provided by HPC systems. In this paper, we use the distributed deep le
 arning framework Horovod to parallelize NT3, a Python benchmark from the e
 xploratory research project CANDLE (Cancer Distributed Learning Environmen
 t). We analyze NT3's scalability, performance, and power characteristics w
 ith different batch sizes and learning rates under two memory modes, cache
  and flat, on the DOE pre-exascale production system Cray XC40 Theta at Ar
 gonne National Laboratory. Our experimental results indicate that the powe
 r profiles for the node, CPU, and memory are  useful in showing how the Ho
 rovod NT3 benchmark behaves on the underlying system. Using the communicat
 ion timeline of this benchmark, we found that the Horovod communication ov
 erhead in NT3 increases significantly with the number of nodes although Ho
 rovod has the ability to scale up.\n\nThe benchmark leads to smaller runti
 me and lower power consumption for the node and CPU under the cache mode t
 han under the flat mode. Furthermore, increasing the batch size leads to a
  runtime decrease and slightly impacts the power. Increasing the learning 
 rate results in a  slight decrease in runtime and node power and an increa
 se in accuracy. Several issues raised by the Horovod NT3 benchmark results
  are discussed, and suggestions are proposed for further work.
URL:https://sc18.supercomputing.org/presentation/?id=ws_phpsc104&sess=sess
 172
END:VEVENT
END:VCALENDAR

