11/12/2018 Colloquium Series - Raquel Hill: “Combating Workflow Failures with Integrity Based Checkpoints and Blockchain”
Workflow management systems are subject to failures, including: processor, network congestion, and machine reboot. Various fault tolerance techniques have been proposed to address these failures. Data integrity errors also cause workflows to fail, but little or no attention has been given to integrity faults. The Scientific Workflow Integrity with Pegasas (SWIP) project has shown data integrity errors do occur in the wild. These errors occur when transferring and storing experiment data. The inability of today’s validation mechanisms such as TCP checksums and Layer 2 checksums, motivated the SWIP project to add an extra layer of application layer data integrity verification using cryptographic hashes. Currently, the SWIP project takes a checkpoint all approach for integrity data; moving all integrity data for a task to stable storage. During this presentation, I will discuss our new approach, whereby we characterize nodes in workflow graphs based on the graph structure and propose several integrity-based checkpointing strategies. These strategies use a node’s properties to determine which nodes to checkpoint. When failures occur, the proposed integrity-based checkpointing strategies allow us to validate the integrity of the data from prior workflow tasks and re-use data during workflow retries. To facilitate the validation process, we implement a blockchain prototype, and evaluate the overhead of securely storing the integrity meta-data on a public blockchain.