BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160727Z
LOCATION:D172
DTSTART;TZID=America/Chicago:20181112T105500
DTEND;TZID=America/Chicago:20181112T112000
UID:submissions.supercomputing.org_SC18_sess168_ws_ia101@linklings.com
SUMMARY:A Fast and Simple Approach to Merge and Merge Sorting Using Wide V
 ector Instructions
DESCRIPTION:Workshop\nArchitectures, Data Analytics, Graph Algorithms, Wor
 kshop Reg Pass\n\nA Fast and Simple Approach to Merge and Merge Sorting Us
 ing Wide Vector Instructions\n\nWatkins, Green\n\nMerging and sorting algo
 rithms are the backbone of many modern computer applications. As such, eff
 icient implementations are desired. Recent architectural advancements in C
 PUs (Central Processing Units), such as wider and more powerful vector ins
 tructions, allow for algorithmic improvements. This paper presents a new a
 pproach to merge sort using vector instructions. Traditional approaches to
  vectorized sorting typically utilize a bitonic sorting network (Batcher's
  Algorithm) which adds significant overhead. Our approach eliminates the o
 verhead from this approach. We start with a branch-avoiding merge algorith
 m and then use the Merge Path algorithm to split up merging between the di
 fferent SIMD lanes. Testing demonstrates that the algorithm not only surpa
 sses the SIMD based bitonic counterpart, but that it is over 2.94X faster 
 than a traditional merge, merging over 300M keys per second in one thread 
 and over 16B keys per second in parallel.  Our new sort reaches is over 5X
  faster than quicksort and 2X faster than Intel's IPP library sort, sortin
 g over 5.3M keys per second for a single processor and in parallel over 50
 0M keys per second and a speedup of over 2X from a traditional merge sort.
URL:https://sc18.supercomputing.org/presentation/?id=ws_ia101&sess=sess168
END:VEVENT
END:VCALENDAR