See the technical program schedule for tutorial location and time.
Data Analysis and Sharing with the ENES Climate Analytics Service (ECAS)
Dahou Sofiane Bendoukha and Tobias Weigel, German Climate Computing Center
The ENES Climate Analytics Service (ECAS) is a new service from the EOSC-hub project. It enables scientific end-users to perform data analysis experiments on large volumes of climate data, by exploiting a PID-enabled, server-side, and parallel approach. It aims at providing a paradigm shift for the ENES community with a strong focus on data intensive analysis, provenance management, and server-side approaches as opposed to the current ones mostly client-based, sequential and with limited/missing end-to-end analytics workflow/provenance capabilities. Furthermore, the integrated data analytics service enables basic data provenance tracking by establishing PID support through the whole chain, and thereby improving reusability, traceability, and reproducibility.
The objective of the tutorial is to present ECAS and its processing and data management capabilities for potential future users. Attendees will learn about the ECAS software stack (Jupyter, Ophidia and others) and how to use the different integrated software packages. Furthermore, besides the processing capabilities, the tutorial also cover data/workflow sharing with other researchers or with broader community experts. This is enabled through integrated Cloud-based services like B2DROP and B2SHARE.
The tutorial will be divided into a teaching as well as a practical hands-on training part and includes:
- presentation(s) on the theoretical and technical background of ECAS. This covers the data cube concept and its operations (e.g.: subset extraction, reduction, aggregation). Furthermore, we provide an introduction to the Ophidia framework, which is the components of ECAS for processing multidimensional data
- tutorials and training materials with hands of Jupyter notebooks. Participants will have the opportunity to dive into the ECAS software stack and learn how to manipulate multidimensional data through real world use cases from the climate domain.
Creating Reproducible Experimentation Workflows with Popper: A Hands-on, Bring Your Own Code Tutorial
Ivo Jimenez and Carlos Maltzahn, University of California, Santa Cruz
Currently, approaches to scientific research require activities that take up much time but do not actually advance our scientific understanding. For example, researchers and their students spend countless hours reformatting data and writing code to attempt to reproduce previously published research. What if the scientific community could find a better way to create and publish their workflows, data, and models to minimize the amount of the time spent “reinventing the wheel”? Popper is an experimentation protocol and CLI tool for implementing scientific exploration pipelines following a DevOps approach that allows researchers to generate work that is easy to reproduce and extend.
Modern open source software development communities have created tools that make it easier to manage large codebases, allowing them to deal with high levels of complexity, not only in terms of managing code changes, but with the entire ecosystem that is needed in order to deliver changes to software in an agile, rapidly changing environment. These practices and tools are collectively referred to as DevOps. The Popper experimentation protocol repurposes the DevOps practice in the context of scientific explorations so that researchers can leverage existing tools and technologies to maintain and publish scientific analyses that are easy to reproduce.
In the first part of this tutorial, we will briefly introduce DevOps and give an overview of best practices. We will then show how these practices can be repurposed for carrying out scientific explorations and illustrate using some examples. The second part of the course will be devoted to hands-on experiences with the goal of walking the audience through the usage of the Popper CLI tool.
Connect your Research Data with Collaborators and Beyond
Amit Chourasia and David Nadeau, San Diego Supercomputer Center
Data is an integral part of scientific research. With a rapid growth in data collection and generation capability and an increasingly collaborative nature of research activities, data management and data sharing have become central and key to accomplishing research goals. Researchers today have variety of solutions at their disposal from local storage to Cloud based storage. However, most of these solutions focus on hierarchical file and folder organization. While such an organization is pervasively used and quite useful, it relegates information about the context of the data such as description and collaborative notes about the data to external systems. This spread of information into different silos impedes the flow research activities.
In this tutorial, we will introduce and provide hands on experience with the SeedMe platform, which provides a web-based data management and data sharing cyberinfrastructure. SeedMe enables research groups to manage, share, search, visualize, and present their data in a web-based environment using an access-controlled, branded, and customizable website they own and control. It supports storing and viewing data in a familiar tree hierarchy, but also supports formatted annotations, lightweight visualizations, and threaded comments on any file/folder. The system can be easily extended and customized to support metadata, job parameters, and other domain and project-specific contextual items. The software is open source and available as an extension to the popular Drupal content management system.
Pegasus Scientific Workflows with Containers
Karan Vahi and Mats Rynge, USC Information Sciences Institute
Workflows are a key technology for enabling complex scientific computations. They capture the interdependencies between processing steps in data analysis and simulation pipelines as well as the mechanisms to execute those steps reliably and efficiently. Workflows can capture complex processes to promote sharing and reuse, and also provide provenance information necessary for the verification of scientific results and scientific reproducibility. Application containers such as Docker and Singularity are increasingly becoming a preferred way for bundling user application code with complex dependencies, to be used during workflow execution.
Pegasus is being used in a number of scientific domains doing production grade science. In 2016 the LIGO gravitational wave experiment used Pegasus to analyze instrumental data and confirm the first detection of a gravitational wave. The Southern California Earthquake Center (SCEC) based at USC, uses a Pegasus managed workflow infrastructure called CyberShake to generate hazard maps for the Southern California region. In March 2017, SCEC conducted a CyberShake study on DOE systems ORNL Titan and NCSA BlueWaters. Overall, the study required 450,000 node-hours of computation across the two systems. Pegasus is also being used in astronomy, bioinformatics, civil engineering, climate modeling, earthquake science, molecular dynamics and other complex analyses.
The goal of the tutorial is to introduce the benefits of modeling pipelines in a portable way with use of scientific workflows with application containers. We will examine the workflow lifecycle at a high level and issues and challenges associated with various steps in the workflow lifecycle such as creation, execution and monitoring and debugging. Through hands on exercises, we will model an application pipeline, bundle the application codes in containers, and execute the pipeline on distributed computing infrastructures. The attendees will leave the tutorial with knowledge on how to implement their own computations using containers and workflows.