github.com/informationsea/shellflow@v0.1.3/docs/source/tutorial.rst (about)

     1  Getting Started
     2  ===============
     3  
     4  Shellflow was designed for rapid developing of research workflow. If you
     5  can write bash script, you don't have to learn a lot of new syntax. Only
     6  You have to add brackets to annotate which files are input or output.
     7  
     8  Before starting tutorial
     9  ------------------------
    10  
    11  In this tutorial, softwares listed in below are required.
    12  
    13  -  `bwa <http://bio-bwa.sourceforge.net/>`__
    14  -  `gatk4 <https://software.broadinstitute.org/gatk/download/>`__
    15  -  `shellflow <https://github.com/informationsea/shellflow>`__
    16  
    17  Data listed in below are also required.
    18  
    19  -  Reference genome (for example:
    20     `hs37d5 <ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence>`__)
    21  -  BWA index of the reference genome (``bwa index hs37d5.fa``)
    22  -  Sequence Dictionary File of the reference genome
    23     (``gatk --java-options "-Xmx4G" CreateSequenceDictionary -R hs37d5.fa``)
    24  -  Some sequnece data (for example:
    25     `DRR002191 <https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=DRR002191>`__)
    26  
    27  1st step: mapping
    28  -----------------
    29  
    30  A syntax of shellflow script is very similar to bash shell script. All
    31  you have to do is enclose input files with double parenthesis (``((``
    32  and ``))``) and output files with double brackets (``[[`` and ``]]``).
    33  You can use pipe and redirect in workflow script like usual shell
    34  script.
    35  
    36  Content of ``gettingstarted.sf``
    37  
    38  .. code:: bash
    39  
    40      bwa mem -t 6 hs37d5.fa <(bzip2 -dc ((DRR002191_1.fastq.bz2))) <(bzip2 -dc ((DRR002191_2.fastq.bz2))) > [[DRR002191.sam]]
    41  
    42  .. code:: bash
    43  
    44      $ shellflow run gettingstarted.sf
    45  
    46  2nd step: check status
    47  ----------------------
    48  
    49  .. code:: bash
    50  
    51      $ shellflow viewlog
    52        #|  State|Success|Failed|Running|Pending|File Changed|Start Date         |Name
    53        1|   Done|      1|     0|      0|      0|         Yes|2018/10/14 15:00:48|step1.sf
    54  
    55  .. code:: bash
    56  
    57      $ shellflow viewlog 1
    58      Workflow Script Path: /home/okamura/Documents/Programming/GO/workspace/src/github.com/informationsea/shellflow/examples/getting-started/step1/step1.sf
    59         Workflow Log Path: shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935
    60                 Job Start: 2018/10/14 15:00:48
    61       Changed Input Files:
    62      ---- Job: 1 ------------
    63                   State: JobDone
    64               Exit code: 0
    65                Reusable: No
    66                  Script: bwa mem -t 6 hs37d5.fa DRR002191_1.fastq.bz2 DRR002191_2.fastq.bz2 > DRR002191.sam
    67                   Input: DRR002191_1.fastq.bz2 DRR002191_2.fastq.bz2
    68                  Output: DRR002191.sam
    69       Dependent Job IDs:
    70           Log directory: shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935/job001
    71  
    72  .. code:: bash
    73  
    74      $ ls shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935/job001
    75      input.json  local-run-pid.txt  output.json  rc  run.sh  run.stderr  run.stdout  script.sh  script.stderr  script.stdout
    76  
    77  3rd step: add more commands
    78  ---------------------------
    79  
    80  When you want to add a new command depends on previous command, add new
    81  line at last. Shellflow automatically judge which commands depend on
    82  other commands. Unlike Makefile, shellflow assumes all dependent
    83  commands can be found before a command line.
    84  
    85  .. code:: bash
    86  
    87      bwa mem -R "@RG\tID:DRR002191\tSM:DRR002191\tPL:illumina\tLB:DRR002191" -t 6 hs37d5.fa <(bzip2 -dc ((DRR002191_1.fastq.bz2))) <(bzip2 -dc ((DRR002191_2.fastq.bz2))) > [[DRR002191.sam]]
    88      gatk SortSam -I ((DRR002191.sam)) -O [[DRR002191-sorted.bam]] --SORT_ORDER coordinate
    89      gatk MarkDuplicates -I ((DRR002191-sorted.bam)) -O [[DRR002191-markdup.bam]] -M [[DRR002191-markdup-metrics.txt]]
    90      gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I ((DRR002191-markdup.bam)) -O [[DRR002191-bqsr.txt]] -R hs37d5.fa 
    91  
    92  Shellflow runs only added commands.
    93  
    94  .. code:: bash
    95  
    96      $ shellflow run gettingstarted.sf
    97  
    98  4th step: use variable
    99  ----------------------
   100  
   101  If a line starts with ``#%``, the line is parsed as flowscript, which is
   102  embedded language of shellflow.
   103  
   104  .. code:: bash
   105  
   106      #% SAMPLE_ID = "DRR002191"
   107      bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
   108      gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
   109      gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
   110      gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 
   111  
   112  5th step: use loop
   113  ------------------
   114  
   115  .. code:: bash
   116  
   117      for SAMPLE_ID in DRR002191 DRR002192; do
   118          bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
   119          gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
   120          gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
   121          gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 
   122      done
   123  
   124  .. code:: bash
   125  
   126      #% SAMPLES = ["DRR002191", "DRR002192"]
   127      for SAMPLE_ID in {{SAMPLES}}; do
   128          bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
   129          gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
   130          gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
   131          gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 
   132      done
   133  
   134  6th step: map all FASTQ in a directory
   135  --------------------------------------
   136  
   137  .. code:: bash
   138  
   139      for FILENAME in *_1.fastq.bz2; do
   140          #% SAMPLE_ID = basename(FILENAME, "_1.fastq.bz2")
   141          bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
   142          gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
   143          gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
   144          gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa 
   145      done