BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:C140/142
DTSTART;TZID=America/Chicago:20181115T143000
DTEND;TZID=America/Chicago:20181115T150000
UID:submissions.supercomputing.org_SC18_sess190_pap322@linklings.com
SUMMARY:Anatomy of High-Performance Deep Learning Convolutions on SIMD Arc
 hitectures
DESCRIPTION:Paper\nApplications, Cosmology, Data Analytics, Deep Learning,
  Machine Learning, Programming Systems, Storage, Visualization, Tech Progr
 am Reg Pass\n\nAnatomy of High-Performance Deep Learning Convolutions on S
 IMD Architectures\n\nGeorganas, Avancha, Banerjee, Kalamkar, Henry...\n\nC
 onvolution layers are prevalent in many classes of deep neural networks, i
 ncluding Convolutional Neural Networks (CNNs) which provide state-of-the-a
 rt results for tasks like image recognition, neural machine translation, a
 nd speech recognition. The computationally expensive nature of a convoluti
 on operation has led to the proliferation of implementations including mat
 rix-matrix multiplication formulation, and direct convolution primarily ta
 rgeting GPUs. In this paper, we introduce direct convolution kernels for x
 86 architectures, in particular for Xeon and Xeon Phi systems, which are i
 mplemented via a dynamic compilation approach. Our JIT-based implementatio
 n shows close to theoretical peak performance, depending on the setting an
 d the CPU architecture at hand. We additionally demonstrate how these JIT-
 optimized kernels can be integrated into a light-weight multi-node graph e
 xecution model. This illustrates that single- and multi-node runs yield hi
 gh efficiencies and high image-throughputs  when executing state of the ar
 t image recognition tasks on CPUs.
URL:https://sc18.supercomputing.org/presentation/?id=pap322&sess=sess190
END:VEVENT
END:VCALENDAR

