github.com/informationsea/shellflow@v0.1.3/docs/source/tutorial.rst (about) 1 Getting Started 2 =============== 3 4 Shellflow was designed for rapid developing of research workflow. If you 5 can write bash script, you don't have to learn a lot of new syntax. Only 6 You have to add brackets to annotate which files are input or output. 7 8 Before starting tutorial 9 ------------------------ 10 11 In this tutorial, softwares listed in below are required. 12 13 - `bwa <http://bio-bwa.sourceforge.net/>`__ 14 - `gatk4 <https://software.broadinstitute.org/gatk/download/>`__ 15 - `shellflow <https://github.com/informationsea/shellflow>`__ 16 17 Data listed in below are also required. 18 19 - Reference genome (for example: 20 `hs37d5 <ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence>`__) 21 - BWA index of the reference genome (``bwa index hs37d5.fa``) 22 - Sequence Dictionary File of the reference genome 23 (``gatk --java-options "-Xmx4G" CreateSequenceDictionary -R hs37d5.fa``) 24 - Some sequnece data (for example: 25 `DRR002191 <https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=DRR002191>`__) 26 27 1st step: mapping 28 ----------------- 29 30 A syntax of shellflow script is very similar to bash shell script. All 31 you have to do is enclose input files with double parenthesis (``((`` 32 and ``))``) and output files with double brackets (``[[`` and ``]]``). 33 You can use pipe and redirect in workflow script like usual shell 34 script. 35 36 Content of ``gettingstarted.sf`` 37 38 .. code:: bash 39 40 bwa mem -t 6 hs37d5.fa <(bzip2 -dc ((DRR002191_1.fastq.bz2))) <(bzip2 -dc ((DRR002191_2.fastq.bz2))) > [[DRR002191.sam]] 41 42 .. code:: bash 43 44 $ shellflow run gettingstarted.sf 45 46 2nd step: check status 47 ---------------------- 48 49 .. code:: bash 50 51 $ shellflow viewlog 52 #| State|Success|Failed|Running|Pending|File Changed|Start Date |Name 53 1| Done| 1| 0| 0| 0| Yes|2018/10/14 15:00:48|step1.sf 54 55 .. code:: bash 56 57 $ shellflow viewlog 1 58 Workflow Script Path: /home/okamura/Documents/Programming/GO/workspace/src/github.com/informationsea/shellflow/examples/getting-started/step1/step1.sf 59 Workflow Log Path: shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935 60 Job Start: 2018/10/14 15:00:48 61 Changed Input Files: 62 ---- Job: 1 ------------ 63 State: JobDone 64 Exit code: 0 65 Reusable: No 66 Script: bwa mem -t 6 hs37d5.fa DRR002191_1.fastq.bz2 DRR002191_2.fastq.bz2 > DRR002191.sam 67 Input: DRR002191_1.fastq.bz2 DRR002191_2.fastq.bz2 68 Output: DRR002191.sam 69 Dependent Job IDs: 70 Log directory: shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935/job001 71 72 .. code:: bash 73 74 $ ls shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935/job001 75 input.json local-run-pid.txt output.json rc run.sh run.stderr run.stdout script.sh script.stderr script.stdout 76 77 3rd step: add more commands 78 --------------------------- 79 80 When you want to add a new command depends on previous command, add new 81 line at last. Shellflow automatically judge which commands depend on 82 other commands. Unlike Makefile, shellflow assumes all dependent 83 commands can be found before a command line. 84 85 .. code:: bash 86 87 bwa mem -R "@RG\tID:DRR002191\tSM:DRR002191\tPL:illumina\tLB:DRR002191" -t 6 hs37d5.fa <(bzip2 -dc ((DRR002191_1.fastq.bz2))) <(bzip2 -dc ((DRR002191_2.fastq.bz2))) > [[DRR002191.sam]] 88 gatk SortSam -I ((DRR002191.sam)) -O [[DRR002191-sorted.bam]] --SORT_ORDER coordinate 89 gatk MarkDuplicates -I ((DRR002191-sorted.bam)) -O [[DRR002191-markdup.bam]] -M [[DRR002191-markdup-metrics.txt]] 90 gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I ((DRR002191-markdup.bam)) -O [[DRR002191-bqsr.txt]] -R hs37d5.fa 91 92 Shellflow runs only added commands. 93 94 .. code:: bash 95 96 $ shellflow run gettingstarted.sf 97 98 4th step: use variable 99 ---------------------- 100 101 If a line starts with ``#%``, the line is parsed as flowscript, which is 102 embedded language of shellflow. 103 104 .. code:: bash 105 106 #% SAMPLE_ID = "DRR002191" 107 bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]] 108 gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate 109 gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]] 110 gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 111 112 5th step: use loop 113 ------------------ 114 115 .. code:: bash 116 117 for SAMPLE_ID in DRR002191 DRR002192; do 118 bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]] 119 gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate 120 gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]] 121 gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 122 done 123 124 .. code:: bash 125 126 #% SAMPLES = ["DRR002191", "DRR002192"] 127 for SAMPLE_ID in {{SAMPLES}}; do 128 bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]] 129 gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate 130 gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]] 131 gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 132 done 133 134 6th step: map all FASTQ in a directory 135 -------------------------------------- 136 137 .. code:: bash 138 139 for FILENAME in *_1.fastq.bz2; do 140 #% SAMPLE_ID = basename(FILENAME, "_1.fastq.bz2") 141 bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]] 142 gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate 143 gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]] 144 gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 145 done