BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160727Z
LOCATION:D221
DTSTART;TZID=America/Chicago:20181111T161000
DTEND;TZID=America/Chicago:20181111T163000
UID:submissions.supercomputing.org_SC18_sess162_ws_cre102@linklings.com
SUMMARY:Reproducibility for Streaming Analysis
DESCRIPTION:Workshop\nExascale, Hot Topics, Reproducibility, Scientific Co
 mputing, Workshop Reg Pass\n\nReproducibility for Streaming Analysis\n\nWr
 ight, Pouchard, Billinge\n\nThe natural and physical sciences increasingly
  need streaming data processing for live data analysis and autonomous expe
 rimentation. Furthermore, data provenance and replicability are important 
 to assure the veracity of scientific results. Here we describe a software 
 system that combines high performance computing, streaming data processing
 , and automatic data provenance capturing to address this need. Data prove
 nance and streaming data processing share a common data structure, the dir
 ected acyclic graph (DAG), which describes the order of each computational
  step. Data processing requires the DAG to specify what computations to ru
 n in what order, and the execution can be recreated from the graph, reprod
 ucing the analyzed data and capturing provenance. In our framework the des
 cription and ordering of the analysis steps (the pipeline) are separated f
 rom their execution (the streaming analysis) and the DAG created for the s
 treaming data processing is captured during data analysis. Streaming data 
 can have high throughputs and our system allows users to choose among mult
 iple parallel processing backends, including Dask. To guarantee reproducib
 ility, unique links to the incoming data, and their timestamps are capture
 d alongside the DAG. Analyzed data, along with provenance metadata, are st
 ored in a database, which can re-run analysis from raw data, enabling veri
 fication of results, exploring how parameters change outcomes, and data pr
 ocessing reuse. This system is running in production at the National Synch
 rotron Light Source-II (NSLS-II) x-ray powder diffraction beamlines.
URL:https://sc18.supercomputing.org/presentation/?id=ws_cre102&sess=sess16
 2
END:VEVENT
END:VCALENDAR

