BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160728Z
LOCATION:D163
DTSTART;TZID=America/Chicago:20181112T153000
DTEND;TZID=America/Chicago:20181112T155500
UID:submissions.supercomputing.org_SC18_sess142_ws_pdsw111@linklings.com
SUMMARY:Characterizing Deep-Learning I/O Workloads in TensorFlow
DESCRIPTION:Workshop\nI/O, Storage, Workshop Reg Pass\n\nCharacterizing De
 ep-Learning I/O Workloads in TensorFlow\n\nChien, Markidis, Sishtla, Santo
 s, Herman...\n\nThe performance of Deep-Learning (DL) computing frameworks
  rely on the performance of data ingestion and checkpointing. In fact, dur
 ing the training, a considerable high number of relatively small files are
  first loaded and pre-processed on CPUs and then moved to accelerator for 
 computation. In addition, checkpointing and restart operations are carried
  out to allow DL computing frameworks to restart quickly from a checkpoint
 . Because of this, I/O affects the performance of DL applications. In this
  work, we characterize the I/O performance and scaling of TensorFlow, an o
 pen-source programming framework developed by Google and specifically desi
 gned for solving DL problems. To measure TensorFlow I/O performance, we fi
 rst design a micro-benchmark to measure TensorFlow reads, and then use a T
 ensorFlow mini-application based on AlexNet to measure the performance cos
 t of I/O and checkpointing in TensorFlow. To improve the checkpointing per
 formance, we design and implement a burst buffer.  We find that increasing
  the number of threads increases TensorFlow bandwidth by a maximum of 2.3×
  and 7.8× on our benchmark environments. The use of the tensorFlow prefetc
 her results in a complete overlap of computation on accelerator and input 
 pipeline on CPU eliminating the effective cost of I/O on the overall perfo
 rmance. The use of a burst buffer to checkpoint to a fast small capacity s
 torage and copy asynchronously the checkpoints to a slower large capacity 
 storage resulted in a performance improvement of 2.6× with respect to chec
 kpointing directly to slower storage on our benchmark environment.
URL:https://sc18.supercomputing.org/presentation/?id=ws_pdsw111&sess=sess1
 42
END:VEVENT
END:VCALENDAR

