BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160726Z
LOCATION:D173
DTSTART;TZID=America/Chicago:20181111T115500
DTEND;TZID=America/Chicago:20181111T122000
UID:submissions.supercomputing.org_SC18_sess163_ws_works111@linklings.com
SUMMARY:A Practical Roadmap for Provenance Capture and Data Analysis in Sp
 ark-Based Scientific Workflows
DESCRIPTION:Workshop\nReproducibility, Scientific Computing, Scientific Wo
 rkflows, Workflows, Workshop Reg Pass, HPC, Data Intensive\n\nA Practical 
 Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific
  Workflows\n\nGuedes, Silva, Mattoso, Bedo, Oliveira\n\nWhenever high-perf
 ormance computing applications meet data-intensive scalable systems, an at
 tractive approach is the use of Apache Spark for the management of scienti
 fic workflows. Spark provides several advantages such as being widely supp
 orted and granting efficient in-memory data management for large-scale app
 lications. However, Spark still lacks support for data tracking and workfl
 ow provenance. Additionally, Spark’s memory management requires accessing 
 all data movements between the workflow activities. Therefore, the running
  of legacy programs on Spark is interpreted as a “black-box” activity, whi
 ch prevents the capture and analysis of implicit data movements. Here, we 
 present SAMbA, an Apache Spark extension for the gathering of prospective 
 and retrospective provenance and domain data within distributed scientific
  workflows. Our approach relies on enveloping both RDD structure and data 
 contents at runtime so that (i) RDD-enclosure consumed and produced data a
 re captured and registered by SAMbA in a structured way, and (ii) provenan
 ce data can be queried during and after the execution of scientific workfl
 ows. By following the W3C PROV representation, we model the roles of RDD r
 egarding prospective and retrospective provenance data. Our solution provi
 des mechanisms for the capture and storage of provenance data without jeop
 ardizing Spark’s performance. The provenance retrieval capabilities of our
  proposal are evaluated in a practical case study, in which data analytics
  are provided by several SAMbA parameterizations.
URL:https://sc18.supercomputing.org/presentation/?id=ws_works111&sess=sess
 163
END:VEVENT
END:VCALENDAR

