BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160728Z
LOCATION:D220
DTSTART;TZID=America/Chicago:20181112T160000
DTEND;TZID=America/Chicago:20181112T163000
UID:submissions.supercomputing.org_SC18_sess172_ws_phpsc106@linklings.com
SUMMARY:Balsam: Automated Scheduling and Execution of Dynamic, Data-Intens
 ive HPC Workflows
DESCRIPTION:Workshop\nParallel Application Frameworks, Reproducibility, Sc
 ientific Computing, Workshop Reg Pass\n\nBalsam: Automated Scheduling and 
 Execution of Dynamic, Data-Intensive HPC Workflows\n\nSalim, Uram, Childer
 s, Vishwanath, Papka...\n\nWe introduce the Balsam service to manage high-
 throughput task scheduling and execution on supercomputing systems. Balsam
  allows users to populate a task database with a variety of tasks ranging 
 from simple independent tasks to dynamic multi-task workflows. With abstra
 ctions for the local resource scheduler and MPI environment, Balsam dynami
 cally packages tasks into ensemble jobs and manages their scheduling lifec
 ycle. The ensembles execute in a pilot "launcher'' which (i) ensures concu
 rrent, load-balanced execution of arbitrary serial and parallel programs w
 ith heterogeneous processor requirements, (ii) requires no modification of
  user applications, (iii) is tolerant of task-level faults and provides se
 veral options for error recovery, (iv) stores provenance data (e.g task hi
 story, error logs) in the database, (v) supports dynamic workflows, in whi
 ch tasks are created or killed at runtime. Here, we present the design and
  Python implementation of the Balsam service and launcher. The efficacy of
  this system is illustrated using two case studies: hyperparameter optimiz
 ation of deep neural networks, and high-throughput single-point quantum ch
 emistry calculations. We find that the unique combination of flexible job-
 packing and automated scheduling with dynamic (pilot-managed) execution fa
 cilitates excellent resource utilization. The scripting overheads typicall
 y needed to manage resources and launch workflows on supercomputers are su
 bstantially reduced, accelerating workflow development and execution.
URL:https://sc18.supercomputing.org/presentation/?id=ws_phpsc106&sess=sess
 172
END:VEVENT
END:VCALENDAR

