Executable books make campaign data tangible

Presentation slides: https://eurec4a.pages.gwdg.de/dach2022_howto_eurec4a/

Theresa Mieslinger, Jule Radtke and Tobias Kölling

Max Planck Institut für Meteorologie & Universität Hamburg

Wealth of data …

… collected in large field campaigns like EUREC4A (Stevens et al. 2021).

Wealth of data …

… and distributed among many research institutions.

How to … make data usable?

How to EUREC4A

Shareable analysis scripts in form of an executable online book …

… and in doing so, it becomes the hub of a larger ecosystem.

The technical ecosystem

Jupyter notebook - an online executable book

centered image

  • book-like structure
  • Markdown for narrative content, e.g. instrument details
  • MyST for executable content, e.g. code examples

centered image

Taking the pain out of data access and distribution

  • cataloging system: for listing datasets
  • drivers: opening instructions

The EUREC4A intake catalog

EUREC4A intake catalog usage

import eurec4a
cat = eurec4a.get_intake_catalog()
ds = cat.dropsondes.JOANNE.level3.to_dask()
ds.plot.scatter("flight_lon", "flight_lat");

Accessing remote data: OPeNDAP

OPeNDAP provides software which makes local data accessible to remote locations regardless of local storage format.

  • advantage: powerful server, allows access of data chunks
  • disadvantage: error-prone

Acessing remote data: the Zarr library

centered image

Zarr is a format for the storage of chunked, compressed, N-dimensional arrays.

  • data in form of blocks or chunks
  • single files allow for parallel processing

Acessing remote data: IPFS - the InterPlanetary File System

centered image

IPFS is a distributed system for storing and accessing files, websites, applications, and data.

  • datasets can be pinned on several distributed servers
  • IPFS searches for your requested dataset and delivers it from any of the close and running servers
  • data is identified by it’s content (CID)

Putting the puzzle back together

The social ecosystem

The social ecosystem

More examples…

Highlights

  • goes far beyond FAIR data: fosters analysis-ready cloud-optimized data (Abernathy et al., 2021)
  • data is used efficiently - duplicates are avoided
  • inclusion
  • collaboration and knowledge transfer
  • scalable to meet future data needs

Conclusions

  • an openly accessible and executable online book makes campaign data tangible
  • explanations about the available instruments, data and typical usage patterns help to get started with the data
  • as hub of a larger ecosystem, data becomes visible, easily accessible, shareable and usable
  • the book lives from, enables and stimulates collaboration and inclusion
  • positive feedback loops stimulate the publication of accessible and understandable datasets in an analysis friendly way