BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160904Z
LOCATION:C2/3/4 Ballroom
DTSTART;TZID=America/Chicago:20181113T083000
DTEND;TZID=America/Chicago:20181113T170000
UID:submissions.supercomputing.org_SC18_sess325_spost135@linklings.com
SUMMARY:Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Dat
 a Repositories
DESCRIPTION:ACM Student Research Competition, Poster\nTech Program Reg Pas
 s, Exhibits Reg Pass\n\nMeasuring Swampiness: Quantifying Chaos in Large H
 eterogeneous Data Repositories\n\nJung, Whitaker\n\nAs scientific data rep
 ositories and filesystems grow in size and complexity, they become increas
 ingly disorganized. The coupling of massive quantities of data with poor o
 rganization makes it challenging for scientists to locate and utilize rele
 vant data, thus slowing the process of analyzing data of interest. To addr
 ess these issues, we explore an automated clustering approach for quantify
 ing the organization of data repositories. Our parallel pipeline processes
  heterogeneous filetypes (e.g., text and tabular data), automatically clus
 ters files based on content and metadata similarities, and computes a nove
 l "cleanliness" score from the resulting clustering. We demonstrate the ge
 neration and accuracy of our cleanliness measure using both synthetic and 
 real datasets, and conclude that it is more consistent than other potentia
 l cleanliness measures.
URL:https://sc18.supercomputing.org/presentation/?id=spost135&sess=sess325
END:VEVENT
END:VCALENDAR