github.com/pachyderm/pachyderm@v1.13.4/examples/gatk/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Variant Discovery with GATK 5 6  7 8 This example illustrates the use of GATK in Pachyderm for [Germline](https://en.wikipedia.org/wiki/Germline) variant calling and joint [genotyping](https://en.wikipedia.org/wiki/Genotyping). 9 10 GATK is a toolkit provided by the Broad Institute that was created to assist scientists with variant discovery and genotyping. Each stage of this GATK best practice pipeline can be scaled individually and is automatically triggered as data flows into the top of the pipeline. The example follows [this tutorial](https://drive.google.com/open?id=0BzI1CyccGsZiQ1BONUxfaGhZRGc) from GATK, which includes more details about the various stages. 11 12 ## Committing the reference genome 13 14 You can retrieve all the input data for this pipeline from the Broad Institute [here](https://drive.google.com/open?id=0BzI1CyccGsZicE5HNkR6anpLTnM). We will utilize a b37 human genome reference containing only a subset of chromosome 20, which we prepared specially for GATK tutorials in order to provide a reasonable size for download. It is accompanied by its index and sequence dictionary. 15 16 Download `GATK_Germline.zip` and unzip it to your local dir. 17 18 ```shell 19 wget https://s3-us-west-1.amazonaws.com/pachyderm.io/Examples_Data_Repo/GATK_Germline.zip 20 ``` 21 ```shell 22 $ unzip GATK_Germline.zip 23 Archive: GATK_Germline.zip 24 creating: data/ 25 inflating: data/.DS_Store 26 creating: __MACOSX/ 27 creating: __MACOSX/data/ 28 inflating: __MACOSX/data/._.DS_Store 29 inflating: data/.Rhistory 30 creating: data/bams/ 31 inflating: data/bams/.DS_Store 32 etc... 33 ``` 34 Change into the directory that contains the files we want to import 35 ```shell 36 $ cd data/ref 37 $ ls 38 Icon ref.dict ref.fasta ref.fasta.fai refSDF 39 ``` 40 Next, we want to create our pachyderm repo and then instruct pachyderm to import those into our repo 41 ```shell 42 $ pachctl create repo reference 43 $ pachctl put file reference@master -r -f . 44 ``` 45 First milestone reached! Lets just check and make sure everything looks good 46 ```shell 47 $ pachctl list repo 48 NAME CREATED SIZE 49 reference 43 seconds ago 83.68MiB 50 ``` 51 ```shell 52 $ pachctl list file reference@master 53 NAME TYPE SIZE 54 Icon file 0B 55 ref.dict file 164B 56 ref.fasta file 61.11MiB 57 ref.fasta.fai file 20B 58 refSDF dir 22.57MiB 59 $ cd ../../ 60 ``` 61 62 ## Committing a sample 63 64 Next, we're going to work on commiting the `*.bam` files for Mom. Let's start by create a repositories for input `*.bam` files to go into: 65 66 ```shell 67 $ pachctl create repo samples 68 ``` 69 70 Add a `*.bam` file (along with it's index file) corresponding to a first sample, in our case it's the `mother`. Here we will assume that the files corresponding to each sample are committed to separate directories (e.g., `/mother`). 71 72 ```shell 73 $ cd data/bams/ 74 $ pachctl start commit samples@master 75 $ for f in $(ls mother.*); do pachctl put file samples@master:mother/$f -f $f; done 76 $ pachctl finish commit samples@master 77 $ cd ../../ 78 ``` 79 80 You should then be able to see the versioned sample data in Pachyderm: 81 82 ```shell 83 $ pachctl list file samples@master 84 NAME TYPE SIZE 85 mother dir 23.79MiB 86 $ pachctl list file samples@master:mother 87 NAME TYPE SIZE 88 mother/mother.bai file 9.047KiB 89 mother/mother.bam file 23.79MiB 90 ``` 91 92 ## Variant calling 93 94 To call variants for the input sample, we will run the `HaplotypeCaller` using GATK. Details of the exact GATK command and related scripting are included in the [likelihoods.json](likelihoods.json) pipeline specification. 95 96 To create and run the variant calling pipeline to generate genotype likelihoods: 97 98 ```shell 99 $ pachctl create pipeline -f likelihoods.json 100 ``` 101 102 This will automatically trigger a job: 103 104 ```shell 105 $ pachctl list job 106 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 107 c61c71d1-6544-48ad-8361-b4ad155ba1a0 likelihoods/992393004c5a45c0a35995cf0179f1cb 43 minutes ago 18 seconds 0 1 + 0 / 1 107.5MiB 4.667MiB success 108 ``` 109 110 And you can view the output of the variant calling as follows: 111 112 ```shell 113 $ pachctl list file likelihoods@master 114 NAME TYPE SIZE 115 mother.g.vcf file 4.667MiB 116 mother.g.vcf.idx file 758B 117 ``` 118 119 ## Joint genotyping 120 121 The last step is to joint call all your GVCF files using the GATK tool GenotypeGVCFs. Details of this GATK command and related scripting are included in the [joint-call.json](joint-call.json) pipeline specification. 122 123 To run the joint genotyping: 124 125 ```shell 126 $ pachctl create pipeline -f joint_call.json 127 ``` 128 129 This will automatically trigger a job and produce our final output: 130 131 ```shell 132 $ pachctl list job 133 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 134 67135f10-4121-4f29-a30b-1eaf6ffe2194 joint_call/c4ebd6dd0c764a97a8d7f3a71f6bb9ce 38 minutes ago 5 seconds 0 1 + 0 / 1 88.35MiB 113.9KiB success 135 c61c71d1-6544-48ad-8361-b4ad155ba1a0 likelihoods/992393004c5a45c0a35995cf0179f1cb 43 minutes ago 18 seconds 0 1 + 0 / 1 107.5MiB 4.667MiB success 136 $ pachctl list file joint_call@master 137 NAME TYPE SIZE 138 joint.vcf file 103.7KiB 139 joint.vcf.idx file 10.21KiB 140 ``` 141 142 ## Adding more samples 143 144 Now that we have our pipelines running, out final results will be automatically updated any time we add new samples. To illustrate this, we can add the `father` and `son` samples as follows: 145 146 ```shell 147 $ cd data/bams/ 148 $ pachctl start commit samples@master 149 dc963cc9bdc2486798b92d20eead5058 150 $ for f in $(ls father.*); do pachctl put file samples@master:father/$f -f $f; done 151 $ pachctl finish commit samples@master 152 $ pachctl start commit samples@master 153 84e6615de64f43d8815909fa978bd4bc 154 $ for f in $(ls son.*); do pachctl put file samples@master:son/$f -f $f; done 155 $ pachctl finish commit samples@master 156 $ pachctl list file samples@master 157 NAME TYPE SIZE 158 father dir 9.662MiB 159 mother dir 23.79MiB 160 son dir 9.58MiB 161 ``` 162 163 This will trigger new jobs to process the new samples: 164 165 ```shell 166 pachctl list job 167 ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS DL UL STATE 168 73222c06-c444-4dff-b370-6c7dea83258d joint_call/32d3615a036e4c0eadd3ed49435ee7db About a minute ago 6 seconds 0 1 + 0 / 1 97.64MiB 188.6KiB success 169 47652b28-a1ba-454e-a74c-7b43740a72f0 likelihoods/a7e160f8418a4bd5a46233bd6b1d84ce 2 minutes ago 16 seconds 0 1 + 2 / 3 93.26MiB 4.729MiB success 170 55b17b8c-b305-4d4c-b822-f861578dadd8 joint_call/4eb7c19600134af5b9595bf5d9f71edc 2 minutes ago 5 seconds 0 1 + 0 / 1 92.92MiB 166.3KiB success 171 a350a349-5ddb-4e19-bdda-66d7edbf9447 likelihoods/e783989ca367428ea2df6406a23bea6c 2 minutes ago 19 seconds 0 1 + 1 / 2 93.35MiB 4.564MiB success 172 67135f10-4121-4f29-a30b-1eaf6ffe2194 joint_call/c4ebd6dd0c764a97a8d7f3a71f6bb9ce About an hour ago 5 seconds 0 1 + 0 / 1 88.35MiB 113.9KiB success 173 c61c71d1-6544-48ad-8361-b4ad155ba1a0 likelihoods/992393004c5a45c0a35995cf0179f1cb About an hour ago 18 seconds 0 1 + 0 / 1 107.5MiB 4.667MiB success 174 ``` 175 176 If you are using the [Enterprise Edition](https://docs.pachyderm.com/1.13.x/enterprise/), you should be able to see the DAG and data as shown below: 177 178 