github.com/pachyderm/pachyderm@v1.13.4/examples/gatk/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/gatk/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Variant Discovery with GATK
     5  
     6  ![alt tag](pipeline.png)
     7  
     8  This example illustrates the use of GATK in Pachyderm for [Germline](https://en.wikipedia.org/wiki/Germline) variant calling and joint [genotyping](https://en.wikipedia.org/wiki/Genotyping).
     9  
    10  GATK is a toolkit provided by the Broad Institute that was created to assist scientists with variant discovery and genotyping. Each stage of this GATK best practice pipeline can be scaled individually and is automatically triggered as data flows into the top of the pipeline. The example follows [this tutorial](https://drive.google.com/open?id=0BzI1CyccGsZiQ1BONUxfaGhZRGc) from GATK, which includes more details about the various stages.
    11  
    12  ## Committing the reference genome
    13  
    14  You can retrieve all the input data for this pipeline from the Broad Institute [here](https://drive.google.com/open?id=0BzI1CyccGsZicE5HNkR6anpLTnM). We will utilize a b37 human genome reference containing only a subset of chromosome 20, which we prepared specially for GATK tutorials in order to provide a reasonable size for download. It is accompanied by its index and sequence dictionary.
    15  
    16  Download `GATK_Germline.zip` and unzip it to your local dir.
    17  
    18  ```shell
    19  wget https://s3-us-west-1.amazonaws.com/pachyderm.io/Examples_Data_Repo/GATK_Germline.zip
    20  ``` 
    21  ```shell
    22  $ unzip GATK_Germline.zip
    23  Archive:  GATK_Germline.zip
    24     creating: data/
    25    inflating: data/.DS_Store
    26     creating: __MACOSX/
    27     creating: __MACOSX/data/
    28    inflating: __MACOSX/data/._.DS_Store
    29    inflating: data/.Rhistory
    30     creating: data/bams/
    31    inflating: data/bams/.DS_Store
    32    etc...
    33  ```
    34  Change into the directory that contains the files we want to import
    35  ```shell
    36  $ cd data/ref
    37  $ ls
    38  Icon  ref.dict  ref.fasta  ref.fasta.fai  refSDF
    39  ```
    40  Next, we want to create our pachyderm repo and then instruct pachyderm to import those into our repo
    41  ```shell
    42  $ pachctl create repo reference
    43  $ pachctl put file reference@master -r -f .
    44  ```
    45  First milestone reached! Lets just check and make sure everything looks good
    46  ```shell
    47  $ pachctl list repo
    48  NAME                CREATED             SIZE
    49  reference           43 seconds ago      83.68MiB
    50  ```
    51  ```shell
    52  $ pachctl list file reference@master
    53  NAME                TYPE                SIZE
    54  Icon                file                0B
    55  ref.dict            file                164B
    56  ref.fasta           file                61.11MiB
    57  ref.fasta.fai       file                20B
    58  refSDF              dir                 22.57MiB
    59  $ cd ../../
    60  ```
    61  
    62  ## Committing a sample 
    63  
    64  Next, we're going to work on commiting the `*.bam` files for Mom. Let's start by create a repositories for input `*.bam` files to go into:
    65  
    66  ```shell
    67  $ pachctl create repo samples
    68  ```
    69  
    70  Add a `*.bam` file (along with it's index file) corresponding to a first sample, in our case it's the `mother`. Here we will assume that the files corresponding to each sample are committed to separate directories (e.g., `/mother`). 
    71  
    72  ```shell
    73  $ cd data/bams/
    74  $ pachctl start commit samples@master
    75  $ for f in $(ls mother.*); do pachctl put file samples@master:mother/$f -f $f; done
    76  $ pachctl finish commit samples@master
    77  $ cd ../../
    78  ```
    79  
    80  You should then be able to see the versioned sample data in Pachyderm:
    81  
    82  ```shell
    83  $ pachctl list file samples@master
    84  NAME                TYPE                SIZE
    85  mother              dir                 23.79MiB
    86  $ pachctl list file samples@master:mother
    87  NAME                TYPE                SIZE
    88  mother/mother.bai   file                9.047KiB
    89  mother/mother.bam   file                23.79MiB
    90  ```
    91  
    92  ## Variant calling
    93  
    94  To call variants for the input sample, we will run the `HaplotypeCaller` using GATK. Details of the exact GATK command and related scripting are included in the [likelihoods.json](likelihoods.json) pipeline specification. 
    95  
    96  To create and run the variant calling pipeline to generate genotype likelihoods:
    97  
    98  ```shell
    99  $ pachctl create pipeline -f likelihoods.json
   100  ```
   101  
   102  This will automatically trigger a job:
   103  
   104  ```shell
   105  $ pachctl list job
   106  ID                                   OUTPUT COMMIT                                STARTED        DURATION   RESTART PROGRESS  DL       UL       STATE
   107  c61c71d1-6544-48ad-8361-b4ad155ba1a0 likelihoods/992393004c5a45c0a35995cf0179f1cb 43 minutes ago 18 seconds 0       1 + 0 / 1 107.5MiB 4.667MiB success
   108  ```
   109  
   110  And you can view the output of the variant calling as follows:
   111  
   112  ```shell
   113  $ pachctl list file likelihoods@master
   114  NAME                TYPE                SIZE
   115  mother.g.vcf        file                4.667MiB
   116  mother.g.vcf.idx    file                758B
   117  ```
   118  
   119  ## Joint genotyping
   120  
   121  The last step is to joint call all your GVCF files using the GATK tool GenotypeGVCFs.  Details of this GATK command and related scripting are included in the [joint-call.json](joint-call.json) pipeline specification.
   122  
   123  To run the joint genotyping:
   124  
   125  ```shell
   126  $ pachctl create pipeline -f joint_call.json
   127  ```
   128  
   129  This will automatically trigger a job and produce our final output:
   130  
   131  ```shell
   132  $ pachctl list job
   133  ID                                   OUTPUT COMMIT                                STARTED        DURATION   RESTART PROGRESS  DL       UL       STATE
   134  67135f10-4121-4f29-a30b-1eaf6ffe2194 joint_call/c4ebd6dd0c764a97a8d7f3a71f6bb9ce  38 minutes ago 5 seconds  0       1 + 0 / 1 88.35MiB 113.9KiB success
   135  c61c71d1-6544-48ad-8361-b4ad155ba1a0 likelihoods/992393004c5a45c0a35995cf0179f1cb 43 minutes ago 18 seconds 0       1 + 0 / 1 107.5MiB 4.667MiB success
   136  $ pachctl list file joint_call@master
   137  NAME                TYPE                SIZE
   138  joint.vcf           file                103.7KiB
   139  joint.vcf.idx       file                10.21KiB
   140  ```
   141  
   142  ## Adding more samples
   143  
   144  Now that we have our pipelines running, out final results will be automatically updated any time we add new samples. To illustrate this, we can add the `father` and `son` samples as follows:
   145  
   146  ```shell
   147  $ cd data/bams/
   148  $ pachctl start commit samples@master
   149  dc963cc9bdc2486798b92d20eead5058
   150  $ for f in $(ls father.*); do pachctl put file samples@master:father/$f -f $f; done
   151  $ pachctl finish commit samples@master
   152  $ pachctl start commit samples@master
   153  84e6615de64f43d8815909fa978bd4bc
   154  $ for f in $(ls son.*); do pachctl put file samples@master:son/$f -f $f; done
   155  $ pachctl finish commit samples@master
   156  $ pachctl list file samples@master
   157  NAME                TYPE                SIZE
   158  father              dir                 9.662MiB
   159  mother              dir                 23.79MiB
   160  son                 dir                 9.58MiB 
   161  ```
   162  
   163  This will trigger new jobs to process the new samples:
   164  
   165  ```shell
   166  pachctl list job
   167  ID                                   OUTPUT COMMIT                                STARTED            DURATION   RESTART PROGRESS  DL       UL       STATE
   168  73222c06-c444-4dff-b370-6c7dea83258d joint_call/32d3615a036e4c0eadd3ed49435ee7db  About a minute ago 6 seconds  0       1 + 0 / 1 97.64MiB 188.6KiB success
   169  47652b28-a1ba-454e-a74c-7b43740a72f0 likelihoods/a7e160f8418a4bd5a46233bd6b1d84ce 2 minutes ago      16 seconds 0       1 + 2 / 3 93.26MiB 4.729MiB success
   170  55b17b8c-b305-4d4c-b822-f861578dadd8 joint_call/4eb7c19600134af5b9595bf5d9f71edc  2 minutes ago      5 seconds  0       1 + 0 / 1 92.92MiB 166.3KiB success
   171  a350a349-5ddb-4e19-bdda-66d7edbf9447 likelihoods/e783989ca367428ea2df6406a23bea6c 2 minutes ago      19 seconds 0       1 + 1 / 2 93.35MiB 4.564MiB success
   172  67135f10-4121-4f29-a30b-1eaf6ffe2194 joint_call/c4ebd6dd0c764a97a8d7f3a71f6bb9ce  About an hour ago  5 seconds  0       1 + 0 / 1 88.35MiB 113.9KiB success
   173  c61c71d1-6544-48ad-8361-b4ad155ba1a0 likelihoods/992393004c5a45c0a35995cf0179f1cb About an hour ago  18 seconds 0       1 + 0 / 1 107.5MiB 4.667MiB success
   174  ```
   175  
   176  If you are using the [Enterprise Edition](https://docs.pachyderm.com/1.13.x/enterprise/), you should be able to see the DAG and data as shown below:
   177  
   178  ![alt tag](dash.png)