github.com/biogo/biogo@v1.0.4/doc.go (about) 1 /* 2 bíogo is a bioinformatics library for the Go language. It is a work in progress. 3 4 The Purpose of bíogo 5 6 bíogo stems from the need to address the size and structure of modern 7 genomic and metagenomic data sets. These properties enforce requirements on the 8 libraries and languages used for analysis: 9 10 • speed - size of data sets 11 • concurrency - problems often embarrassingly parallelisable 12 13 In addition to the computational burden of massive data set sizes in modern 14 genomics there is an increasing need for complex pipelines to resolve questions 15 in tightening problem space and also a developing need to be able to develop 16 new algorithms to allow novel approaches to interesting questions. These issues 17 suggest the need for a simplicity in syntax to facilitate: 18 19 • ease of coding 20 • checking for correctness in development and particularly in peer review 21 22 Related to the second issue is the reluctance of some researchers to release 23 code because of quality concerns 24 http://www.nature.com/news/2010/101013/full/467753a.html 25 26 The issue of code release is the first of the principles formalised in the 27 Science Code Manifesto http://sciencecodemanifesto.org/ 28 29 Code All source code written specifically to process data for a published 30 paper must be available to the reviewers and readers of the paper. 31 32 A language with a simple, yet expressive, syntax should facilitate development 33 of higher quality code and thus help reduce this barrier to research code 34 release. 35 36 Yet Another Bioinformatics Library 37 38 It seems that nearly every language has it own bioinformatics library, some of 39 which are very mature, for example BioPerl and BioPython. Why add another one? 40 41 The different libraries excel in different fields, acting as scripting glue for 42 applications in a pipeline (much of [1-3]) and interacting with external hosts 43 [1, 2, 4, 5], wrapping lower level high performance languages with more user 44 friendly syntax [1-4] or providing bioinformatics functions for high 45 performance languages [5, 6]. 46 47 The intended niche for bíogo lies somewhere between the scripting libraries 48 and high performance language libraries in being easy to use for both small and 49 large projects while having reasonable performance with computationally 50 intensive tasks. 51 52 The intent is to reduce the level of investment required to develop new 53 research software for computationally intensive tasks. 54 55 [1] BioPerl http://bioperl.org/ 56 http://genome.cshlp.org/content/12/10/1611.full 57 http://www.springerlink.com/content/pp72033m171568p2 58 59 [2] BioPython http://biopython.org/ 60 http://bioinformatics.oxfordjournals.org/content/25/11/1422 61 62 [3] BioRuby http://bioruby.org/ 63 http://bioinformatics.oxfordjournals.org/content/26/20/2617 64 65 [4] PyCogent http://pycogent.sourceforge.net/ 66 http://genomebiology.com/2007/8/8/R171 67 68 [5] BioJava http://biojava.org/ 69 http://bioinformatics.oxfordjournals.org/content/24/18/2096 70 71 [6] SeqAn http://www.seqan.de/ 72 http://www.biomedcentral.com/1471-2105/9/11 73 74 Library Structure and Coding Style 75 76 The bíogo library structure is influenced both by the structure of BioPerl and 77 the Go core libraries. 78 79 The coding style should be aligned with normal Go idioms as represented in the 80 Go core libraries. 81 82 Position Numbering 83 84 Position numbering in the bíogo library conforms to the zero-based indexing 85 of Go and range indexing conforms to Go's half-open zero-based slice indexing. 86 This is at odds with the 'normal' inclusive indexing used by molecular 87 biologists. This choice was made to avoid inconsistent indexing spaces being 88 used — one-based inclusive for bíogo functions and methods and zero-based for 89 native Go slices and arrays — and so avoid errors that this would otherwise 90 facilitate. Note that the GFF package does allow, and defaults to, one-based 91 inclusive indexing in its input and output of GFF files. 92 93 EWD831 Why numbering should start at zero 94 95 To denote the subsequence of natural numbers 2, 3, ..., 12 without the 96 pernicious three dots, four conventions are open to us 97 98 a) 2 ≤ i< 13 99 b) 1 < i≤ 12 100 c) 2 ≤ i≤ 12 101 d) 1 < i< 13 102 103 Are there reasons to prefer one convention to the other? Yes, there are. 104 The observation that conventions a) and b) have the advantage that the 105 difference between the bounds as mentioned equals the length of the 106 subsequence is valid. So is the observation that, as a consequence, in 107 either convention two subsequences are adjacent means that the upper 108 bound of the one equals the lower bound of the other. Valid as these 109 observations are, they don't enable us to choose between a) and b); so 110 let us start afresh. 111 112 There is a smallest natural number. Exclusion of the lower bound —as in 113 b) and d)— forces for a subsequence starting at the smallest natural 114 number the lower bound as mentioned into the realm of the unnatural 115 numbers. That is ugly, so for the lower bound we prefer the ≤ as in a) 116 and c). Consider now the subsequences starting at the smallest natural 117 number: inclusion of the upper bound would then force the latter to be 118 unnatural by the time the sequence has shrunk to the empty one. That is 119 ugly, so for the upper bound we prefer < as in a) and d). We conclude 120 that convention a) is to be preferred. 121 122 Remark The programming language Mesa, developed at Xerox PARC, has 123 special notations for intervals of integers in all four conventions. 124 Extensive experience with Mesa has shown that the use of the other three 125 conventions has been a constant source of clumsiness and mistakes, and 126 on account of that experience Mesa programmers are now strongly advised 127 not to use the latter three available features. I mention this 128 experimental evidence —for what it is worth— because some people feel 129 uncomfortable with conclusions that have not been confirmed in practice. 130 (End of Remark.) 131 132 * * 133 * 134 135 When dealing with a sequence of length N, the elements of which we wish 136 to distinguish by subscript, the next vexing question is what subscript 137 value to assign to its starting element. Adhering to convention a) 138 yields, when starting with subscript 1, the subscript range 1 ≤ i < 139 N+1; starting with 0, however, gives the nicer range 0 ≤ i < N. So 140 let us let our ordinals start at zero: an element's ordinal (subscript) 141 equals the number of elements preceding it in the sequence. And the 142 moral of the story is that we had better regard —after all those 143 centuries!— zero as a most natural number. 144 145 Remark Many programming languages have been designed without due 146 attention to this detail. In FORTRAN subscripts always start at 1; in 147 ALGOL 60 and in PASCAL, convention c) has been adopted; the more recent 148 SASL has fallen back on the FORTRAN convention: a sequence in SASL is at 149 the same time a function on the positive integers. Pity! (End of 150 Remark.) 151 152 * * 153 * 154 155 The above has been triggered by a recent incident, when, in an emotional 156 outburst, one of my mathematical colleagues at the University —not a 157 computing scientist— accused a number of younger computing scientists of 158 "pedantry" because —as they do by habit— they started numbering at zero. 159 He took consciously adopting the most sensible convention as a 160 provocation. (Also the "End of ..." convention is viewed of as 161 provocative; but the convention is useful: I know of a student who 162 almost failed at an examination by the tacit assumption that the 163 questions ended at the bottom of the first page.) I think Antony Jay is 164 right when he states: "In corporate religions as in others, the heretic 165 must be cast out not because of the probability that he is wrong but 166 because of the possibility that he is right." 167 168 169 Plataanstraat 5 11 August 1982 170 5671 AL NUENEN prof.dr. Edsger W. Dijkstra 171 The Netherlands Burroughs Research Fellow 172 173 174 Quality Scores 175 176 Quality scores are supported for all sequence types, including protein. Phred 177 and Solexa scoring systems are able to be read from files, however internal 178 representation of quality scores is with Phred, so there will be precision loss 179 in conversion. A Solexa quality score type is provided for use where this will 180 be a problem. 181 182 Copyright ©2011-2012 The bíogo Authors except where otherwise noted. All rights 183 reserved. Use of this source code is governed by a BSD-style license that can be 184 found in the LICENSE file. 185 */ 186 package biogo