4M.4.FULLSTACKS: Cross-stack Compute

Quickstart Tutorial
Introduction
Use of Globus Auth Token
Globus Genomics WES Interface
Analysis of 5 Downsampled CRAM inputs

Quickstart Tutorial

This quickstart tutorial walks through a quick submission of 5 downsampled TOPMed CRAM input files using a TOPMed Alignment workflow in CWL. It uses a portal to index and search the input datasets and submits to a WES (Workflow Execution Service - GA4GH) service deployed as a shim-layer on the Galaxy based Globus Genomics platform.

Login to the search portal at: https://globus-portal.fair-research.org/search/ using your Globus credentials
Search for the 5 downsampled input CRAM files using the search tag “downsampled”
Select the 5 samples by checking the box next to “downsampled” in the left menu
Click on “Add Minids” button, which creates a Workspace called “Downsampled Topmed” and adds these 5 samples for analysis
Then click on the “Start” button for each input CRAM file to initiate the alignment workflow using Globus Genomics backend.
Typically, after 20-25mins, the analysis of the 5 downsampled inputs should be completed and you should see the resulting BDBag-Minid under the “Output files” column of the workspace. The picture below shows a screenshot of the workspace used for the analysis of the 5 input CRAM files

Screenshot

Introduction

This README describes the implementation of a fullstacks platform that allows to:

Analyze open access TOPMed WGS data by sharing the same CWL workflow around with other fullstacks
Scale up the analysis to 5 downsampled, 7 full size WGS open access TOPMed samples and 25 of the remaining 100 samples
Demonstrate the use of GA4GH Workflow Execution Service (WES) implementation to standardize the workflow execution across multiple fullstacks
Build upon the “Workspaces” implemented in 3M.4 Fullstacks demo and add additional features to submit to the WES interface
Use of Globus auth tokens for user access and user management in Galaxy within Globus Genomics

Some of the highlights of this month’s deliverable are:

We indexed the 5 downsampled and 107 open access TOPMed samples within the data portal at https://globus-portal.fair-research.org/workflows/
We implemented a CWL-Runner tool within Galaxy to support CWL workflow execution within Galaxy based Globus Genomics
We implemented a GA4GH WES service that provides a standard interface to allow CWL based workflow submission and workflow status tracking hiding the Galaxy specific details
A major feature is the use of Globus auth tokens for user-management on the Galaxy side, thus eliminating the need for a Galaxy API keys used in the previous month deliverables
Extended the fair research data portal to act as an workflow orchestrator to submit CWL workflows to Globus Genomics via the new WES interface

Use of Globus Auth Token:

One of the highlights of this deliverable is the use of Globus Auth tokens instead of the Galaxy API Keys to interact with Galaxy. Within the WES implementation, the Globus auth token is used to map the user to the local Galaxy user. If the user doesn’t exist, from within the WES, we create that user using the Galaxy Bioblend API and generate a Galaxy API Key that is then used internally. If the user already exists, we map the Globus Auth token to the user and retrieve the API key and use it to interact with Galaxy. It significantly simplifies the authentication, authorization and the ease of use of our fullstacks platform.

We demonstrate this feature by using the data portal that uses Globus authentication to login. And the portal submits the CWL workflows to the WES interface with the Globus auth tokens in the headers that have the Globus Genomics application scope for further validation.

Globus Genomics WES Interface

GA4GH specifications for the Workflow Execution Service is available as Swagger UI at: http://ga4gh.github.io/workflow-execution-service-schemas/ The Globus Genomics WES service is implemented to the above specification and the available at: https://nihcommons.globusgenomics.org/wes/service-info

The resources implemented in this WES are:

GET: /service-info
GET: /workflows
POST: /workflows
GET: /workflows/
DELETE: /workflows/
GET: /workflows//status

Detailed descriptions and usage of each resource is available at: http://ga4gh.github.io/workflow-execution-service-schemas/

Analysis of 5 Downsampled CRAM inputs

Using Data Portal

The FAIR Research data portal is available at: https://globus-portal.fair-research.org and users can login using the Login link in the top-right corner. Screenshot

When asked for the consent, please allow the portal to access the information and services listed. You will notice that you are also allowing access to the Globus Genomics service. The scope to use Globus Genomics is added to the Auth token generated by Globus, which will be presented by the portal to the WES service within Globus Genomics. Screenshot

Search

The downsampled CRAM files have an annotation of “downsampled” within the data portal. Use the search term “downsampled” in the search box at: https://globus-portal.fair-research.org/search/

Select the checkbox next to “downsampled” in the left hand menu as shown in the screenshot below.

Click on the “Add Minids” button to add the 5 samples to a “Workspace” collection called “Downsampled Topmed” as shown below.

Submit

From within the Workspace, click on each “Start” button to initiate a submission to the WES service.

Screenshot

Status

These downsampled CRAMs, typically take about 20-25mins to complete the analysis. Once the analysis is complete, you will notice the status in Workspace changes to “Complete” and you should also see the MINID for the output BDBag generated. The screenshot below shows a completed analysis:

Screenshot

Using CURL from command line

Get Globus Token

The WES API exists as a public Resource Server secured with Globus Auth, so you may request a token and submit samples using any registered Globus app.

Starting from scratch requires three steps:

Register a Globus App at https://developers.globus.org
Configure your app to request the NIH Commons scope
- Example here: https://github.com/fair-research/fair-genomics/blob/master/4M.4.Fullstacks/sample_app.py
Login to receive your token

JSON Payloads

For each of the 5 downsampled CRAM inputs, the three reference files (reference_genome, bwa_index and dbsnp) will be same. Please note the minids for these 3 reference files within the JSON payload example below

The input minids for the 5 downsampled CRAMs are:

Downsample CRAM/CRAI ID Number: NWD176325, NA19238 – ark:/99999/fk4U4TyRAKafWMB
Downsample CRAM/CRAI ID Number: NWD136397, HG01110 – ark:/99999/fk456x1jMoFxfKB
Downsample CRAM/CRAI ID Number: NWD119836, NA12878 – ark:/99999/fk4cAzlMXIUOfes
Downsample CRAM/CRAI ID Number: NWD315403, HG01249 – ark:/99999/fk41FSiqz9iY58R1
Downsample CRAM/CRAI ID Number: NWD231092, HG01111 – ark:/99999/fk4jVBacAVBkFsL

Below is an example of the JSON payload used to do a POST request to the WES at https://nihcommons.globusgenomics.org/wes/workflows

{
  "workflow_params": {
    "reference_genome": {
      "class": "File",
      "path": "ark:/99999/fk4aZVT0ZWH8Ip0"
    },
    "bwa_index": {
      "class": "File",
      "path": "ark:/99999/fk4erydOcxk7PA2"
    },
    "dbsnp": {
      "class": "File",
      "path": "ark:/99999/fk4zKBK8XkAnaXQ"
    },
    "input_file": {
      "class": "File",
      "path": "ark:/99999/fk4U4TyRAKafWMB"
    }
  },
  "workflow_descriptor": "TOPMed Alignment Workflow",
  "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl",
  "workflow_type_version": "v1.0",
  "workflow_type": "CWL"
}

CURL Commands

Here are the 5 command-line curl statements for submitting the 5 Downsampled CRAM input files. Replace the <TOKEN> with the token generated in the previous section:

Downsample CRAM/CRAI ID Number: NWD176325, NA19238:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk4U4TyRAKafWMB"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD315403, HG01249:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk41FSiqz9iY58R1"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD136397, HG01110:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk456x1jMoFxfKB"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD231092, HG01111:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk4jVBacAVBkFsL"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD119836, NA12878:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk4cAzlMXIUOfes"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

These curl commands return a tracking ID (workflow-id) that can be used to check the status, as shown in the next section.

Check Status

The WES resources for a detailed status, also provides the MINDs for the output BDBag once the analysis is complete. You can

curl -H "Accept: application/json" -H "Content-Type: application/json" -X GET -H "Authorization: <TOKEN>" https://nihcommons.globusgenomics.org/wes/workflows/<workflow-id>

Results

Following table provides the results of analysis with md5sum for the output, along with time taken and cost of analysis. Please note that the md5sum are calculated after removing the headers from the output CRAM file, so that the md5sum can be compared with the results from other fullstacks.

Downsample Inputs	md5sum	Runtime	Cost($)
NWD119836	105bf65c2e4ea23f7a110bee17c1a074	19mins	0.036
NWD136397	c8bab3ba0f90406a035cabb243716356	19mins	0.036
NWD176325	186d2cdf1efdc2746e6d3b26cd887c0a	19mins	0.036
NWD231092	4ac1e5edc1fd9d0644d2c0082ac02392	19mins	0.036
NWD315403	efda0cdef1e172f495052a62a93d799c	19mins	0.036

back

This work was supported in part by NIH grant 1U54EB020406-01.

FAIR Research

Best practices, tools and tips for integrating FAIR data principles into your daily work.

4M.4.FULLSTACKS: Cross-stack Compute

Table of Contents

Quickstart Tutorial

Introduction

Use of Globus Auth Token:

Globus Genomics WES Interface

Analysis of 5 Downsampled CRAM inputs

Using Data Portal

Search

Submit

Status

Using CURL from command line

Get Globus Token

JSON Payloads

CURL Commands

Check Status

Results

4M.4.FULLSTACKS: Cross-stack Compute

Table of Contents

Quickstart Tutorial

Introduction

Use of Globus Auth Token:

Globus Genomics WES Interface

Analysis of 5 Downsampled CRAM inputs

Using Data Portal

User Login to FAIR Research Data Portal

Search

Submit

Status

Using CURL from command line

Get Globus Token

JSON Payloads

CURL Commands

Check Status

Results