Team Argon FAIR Research

FAIR Research

Best practices, tools and tips for integrating FAIR data principles into your daily work.

4M.4.FULLSTACKS: Cross-stack Compute

Table of Contents

  1. Quickstart Tutorial
  2. Introduction
  3. Use of Globus Auth Token
  4. Globus Genomics WES Interface
  5. Analysis of 5 Downsampled CRAM inputs
    1. Using Data Portal
      1. User Login to FAIR Research Data Portal
      2. Search Downsampled CRAM
      3. Submit Samples
      4. Status
    2. Using CURL from command line
      1. Get Globus Token
      2. JSON Payloads for the 5 downsampled CRAM files
      3. CURL Commands for the 5 downsampled CRAM files
    3. Results

Quickstart Tutorial

This quickstart tutorial walks through a quick submission of 5 downsampled TOPMed CRAM input files using a TOPMed Alignment workflow in CWL. It uses a portal to index and search the input datasets and submits to a WES (Workflow Execution Service - GA4GH) service deployed as a shim-layer on the Galaxy based Globus Genomics platform.

Screenshot

Introduction

This README describes the implementation of a fullstacks platform that allows to:

Some of the highlights of this month’s deliverable are:

Use of Globus Auth Token:

One of the highlights of this deliverable is the use of Globus Auth tokens instead of the Galaxy API Keys to interact with Galaxy. Within the WES implementation, the Globus auth token is used to map the user to the local Galaxy user. If the user doesn’t exist, from within the WES, we create that user using the Galaxy Bioblend API and generate a Galaxy API Key that is then used internally. If the user already exists, we map the Globus Auth token to the user and retrieve the API key and use it to interact with Galaxy. It significantly simplifies the authentication, authorization and the ease of use of our fullstacks platform.

We demonstrate this feature by using the data portal that uses Globus authentication to login. And the portal submits the CWL workflows to the WES interface with the Globus auth tokens in the headers that have the Globus Genomics application scope for further validation.

Globus Genomics WES Interface

GA4GH specifications for the Workflow Execution Service is available as Swagger UI at: http://ga4gh.github.io/workflow-execution-service-schemas/ The Globus Genomics WES service is implemented to the above specification and the available at: https://nihcommons.globusgenomics.org/wes/service-info

The resources implemented in this WES are:

Detailed descriptions and usage of each resource is available at: http://ga4gh.github.io/workflow-execution-service-schemas/

Analysis of 5 Downsampled CRAM inputs

Using Data Portal

User Login to FAIR Research Data Portal

The FAIR Research data portal is available at: https://globus-portal.fair-research.org and users can login using the Login link in the top-right corner. Screenshot

When asked for the consent, please allow the portal to access the information and services listed. You will notice that you are also allowing access to the Globus Genomics service. The scope to use Globus Genomics is added to the Auth token generated by Globus, which will be presented by the portal to the WES service within Globus Genomics. Screenshot

The downsampled CRAM files have an annotation of “downsampled” within the data portal. Use the search term “downsampled” in the search box at: https://globus-portal.fair-research.org/search/

Select the checkbox next to “downsampled” in the left hand menu as shown in the screenshot below. Screenshot

Click on the “Add Minids” button to add the 5 samples to a “Workspace” collection called “Downsampled Topmed” as shown below.

Submit

From within the Workspace, click on each “Start” button to initiate a submission to the WES service.

Screenshot

Status

These downsampled CRAMs, typically take about 20-25mins to complete the analysis. Once the analysis is complete, you will notice the status in Workspace changes to “Complete” and you should also see the MINID for the output BDBag generated. The screenshot below shows a completed analysis:

Screenshot

Using CURL from command line

Get Globus Token

The WES API exists as a public Resource Server secured with Globus Auth, so you may request a token and submit samples using any registered Globus app.

Starting from scratch requires three steps:

  1. Register a Globus App at https://developers.globus.org
  2. Configure your app to request the NIH Commons scope
  3. Login to receive your token

JSON Payloads

For each of the 5 downsampled CRAM inputs, the three reference files (reference_genome, bwa_index and dbsnp) will be same. Please note the minids for these 3 reference files within the JSON payload example below

The input minids for the 5 downsampled CRAMs are:

Below is an example of the JSON payload used to do a POST request to the WES at https://nihcommons.globusgenomics.org/wes/workflows

{
  "workflow_params": {
    "reference_genome": {
      "class": "File",
      "path": "ark:/99999/fk4aZVT0ZWH8Ip0"
    },
    "bwa_index": {
      "class": "File",
      "path": "ark:/99999/fk4erydOcxk7PA2"
    },
    "dbsnp": {
      "class": "File",
      "path": "ark:/99999/fk4zKBK8XkAnaXQ"
    },
    "input_file": {
      "class": "File",
      "path": "ark:/99999/fk4U4TyRAKafWMB"
    }
  },
  "workflow_descriptor": "TOPMed Alignment Workflow",
  "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl",
  "workflow_type_version": "v1.0",
  "workflow_type": "CWL"
}

CURL Commands

Here are the 5 command-line curl statements for submitting the 5 Downsampled CRAM input files. Replace the <TOKEN> with the token generated in the previous section:

Downsample CRAM/CRAI ID Number: NWD176325, NA19238:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk4U4TyRAKafWMB"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD315403, HG01249:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk41FSiqz9iY58R1"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD136397, HG01110:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk456x1jMoFxfKB"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD231092, HG01111:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk4jVBacAVBkFsL"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

Downsample CRAM/CRAI ID Number: NWD119836, NA12878:

curl -H "Accept: application/json" -H "Content-Type: application/json" -X POST -H "Authorization: <TOKEN>" '{"workflow_params": {"reference_genome": {"class": "File", "path": "ark:/99999/fk4aZVT0ZWH8Ip0"}, "bwa_index": {"class": "File", "path": "ark:/99999/fk4erydOcxk7PA2"}, "dbsnp": {"class": "File", "path": "ark:/99999/fk4zKBK8XkAnaXQ"}, "input_file": {"class": "File", "path": "ark:/99999/fk4cAzlMXIUOfes"}}, "workflow_descriptor": "TOPMed Alignment Workflow", "workflow_url": "https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/master/aligner/sbg-alignment-cwl/topmed-alignment.cwl", "workflow_type_version": "v1.0", "workflow_type": "CWL"}' https://nihcommons.globusgenomics.org/wes/workflows

These curl commands return a tracking ID (workflow-id) that can be used to check the status, as shown in the next section.

Check Status

The WES resources for a detailed status, also provides the MINDs for the output BDBag once the analysis is complete. You can

curl -H "Accept: application/json" -H "Content-Type: application/json" -X GET -H "Authorization: <TOKEN>" https://nihcommons.globusgenomics.org/wes/workflows/<workflow-id>

Results

Following table provides the results of analysis with md5sum for the output, along with time taken and cost of analysis. Please note that the md5sum are calculated after removing the headers from the output CRAM file, so that the md5sum can be compared with the results from other fullstacks.

Downsample Inputs md5sum Runtime Cost($)
NWD119836 105bf65c2e4ea23f7a110bee17c1a074 19mins 0.036
NWD136397 c8bab3ba0f90406a035cabb243716356 19mins 0.036
NWD176325 186d2cdf1efdc2746e6d3b26cd887c0a 19mins 0.036
NWD231092 4ac1e5edc1fd9d0644d2c0082ac02392 19mins 0.036
NWD315403 efda0cdef1e172f495052a62a93d799c 19mins 0.036

back


This work was supported in part by NIH grant 1U54EB020406-01.