github.com/biogo/biogo@v1.0.4/doc.go (about)

     1  /*
     2  bíogo is a bioinformatics library for the Go language. It is a work in progress.
     3  
     4  The Purpose of bíogo
     5  
     6  bíogo stems from the need to address the size and structure of modern
     7  genomic and metagenomic data sets. These properties enforce requirements on the
     8  libraries and languages used for analysis:
     9  
    10  	• speed - size of data sets
    11  	• concurrency - problems often embarrassingly parallelisable
    12  
    13  In addition to the computational burden of massive data set sizes in modern
    14  genomics there is an increasing need for complex pipelines to resolve questions
    15  in tightening problem space and also a developing need to be able to develop
    16  new algorithms to allow novel approaches to interesting questions. These issues
    17  suggest the need for a simplicity in syntax to facilitate:
    18  
    19  	• ease of coding
    20  	• checking for correctness in development and particularly in peer review
    21  
    22  Related to the second issue is the reluctance of some researchers to release
    23  code because of quality concerns
    24  http://www.nature.com/news/2010/101013/full/467753a.html
    25  
    26  The issue of code release is the first of the principles formalised in the
    27  Science Code Manifesto http://sciencecodemanifesto.org/
    28  
    29   Code	All source code written specifically to process data for a published
    30  	paper must be available to the reviewers and readers of the paper.
    31  
    32  A language with a simple, yet expressive, syntax should facilitate development
    33  of higher quality code and thus help reduce this barrier to research code
    34  release.
    35  
    36  Yet Another Bioinformatics Library
    37  
    38  It seems that nearly every language has it own bioinformatics library, some of
    39  which are very mature, for example BioPerl and BioPython. Why add another one?
    40  
    41  The different libraries excel in different fields, acting as scripting glue for
    42  applications in a pipeline (much of [1-3]) and interacting with external hosts
    43  [1, 2, 4, 5], wrapping lower level high performance languages with more user
    44  friendly syntax [1-4] or providing bioinformatics functions for high
    45  performance languages [5, 6].
    46  
    47  The intended niche for bíogo lies somewhere between the scripting libraries
    48  and high performance language libraries in being easy to use for both small and
    49  large projects while having reasonable performance with computationally
    50  intensive tasks.
    51  
    52  The intent is to reduce the level of investment required to develop new
    53  research software for computationally intensive tasks.
    54  
    55   [1] BioPerl http://bioperl.org/
    56   	http://genome.cshlp.org/content/12/10/1611.full
    57   	http://www.springerlink.com/content/pp72033m171568p2
    58  
    59   [2] BioPython http://biopython.org/
    60  	http://bioinformatics.oxfordjournals.org/content/25/11/1422
    61  
    62   [3] BioRuby http://bioruby.org/
    63   	http://bioinformatics.oxfordjournals.org/content/26/20/2617
    64  
    65   [4] PyCogent http://pycogent.sourceforge.net/
    66   	http://genomebiology.com/2007/8/8/R171
    67  
    68   [5] BioJava http://biojava.org/
    69  	http://bioinformatics.oxfordjournals.org/content/24/18/2096
    70  
    71   [6] SeqAn http://www.seqan.de/
    72   	http://www.biomedcentral.com/1471-2105/9/11
    73  
    74  Library Structure and Coding Style
    75  
    76  The bíogo library structure is influenced both by the structure of BioPerl and
    77  the Go core libraries.
    78  
    79  The coding style should be aligned with normal Go idioms as represented in the
    80  Go core libraries.
    81  
    82  Position Numbering
    83  
    84  Position numbering in the bíogo library conforms to the zero-based indexing
    85  of Go and range indexing conforms to Go's half-open zero-based slice indexing.
    86  This is at odds with the 'normal' inclusive indexing used by molecular
    87  biologists. This choice was made to avoid inconsistent indexing spaces being
    88  used — one-based inclusive for bíogo functions and methods and zero-based for
    89  native Go slices and arrays — and so avoid errors that this would otherwise
    90  facilitate.  Note that the GFF package does allow, and defaults to, one-based
    91  inclusive indexing in its input and output of GFF files.
    92  
    93  	EWD831 Why numbering should start at zero
    94  
    95  	To denote the subsequence of natural numbers 2, 3, ..., 12 without the
    96  	pernicious three dots, four conventions are open to us
    97  
    98  	a) 2 ≤ i< 13
    99  	b) 1 < i≤ 12
   100  	c) 2 ≤ i≤ 12
   101  	d) 1 < i< 13
   102  
   103  	Are there reasons to prefer one convention to the other? Yes, there are.
   104  	The observation that conventions a) and b) have the advantage that the
   105  	difference between the bounds as mentioned equals the length of the
   106  	subsequence is valid. So is the observation that, as a consequence, in
   107  	either convention two subsequences are adjacent means that the upper
   108  	bound of the one equals the lower bound of the other. Valid as these
   109  	observations are, they don't enable us to choose between a) and b); so
   110  	let us start afresh.
   111  
   112  	There is a smallest natural number. Exclusion of the lower bound —as in
   113  	b) and d)— forces for a subsequence starting at the smallest natural
   114  	number the lower bound as mentioned into the realm of the unnatural
   115  	numbers. That is ugly, so for the lower bound we prefer the ≤ as in a)
   116  	and c). Consider now the subsequences starting at the smallest natural
   117  	number: inclusion of the upper bound would then force the latter to be
   118  	unnatural by the time the sequence has shrunk to the empty one. That is
   119  	ugly, so for the upper bound we prefer < as in a) and d). We conclude
   120  	that convention a) is to be preferred.
   121  
   122  	Remark  The programming language Mesa, developed at Xerox PARC, has
   123  	special notations for intervals of integers in all four conventions.
   124  	Extensive experience with Mesa has shown that the use of the other three
   125  	conventions has been a constant source of clumsiness and mistakes, and
   126  	on account of that experience Mesa programmers are now strongly advised
   127  	not to use the latter three available features. I mention this
   128  	experimental evidence —for what it is worth— because some people feel
   129  	uncomfortable with conclusions that have not been confirmed in practice.
   130  	(End of Remark.)
   131  
   132  				*                *
   133  					*
   134  
   135  	When dealing with a sequence of length N, the elements of which we wish
   136  	to distinguish by subscript, the next vexing question is what subscript
   137  	value to assign to its starting element. Adhering to convention a)
   138  	yields, when starting with subscript 1, the subscript range 1 ≤  i <
   139  	N+1; starting with 0, however, gives the nicer range 0 ≤   i <  N. So
   140  	let us let our ordinals start at zero: an element's ordinal (subscript)
   141  	equals the number of elements preceding it in the sequence. And the
   142  	moral of the story is that we had better regard —after all those
   143  	centuries!— zero as a most natural number.
   144  
   145  	Remark  Many programming languages have been designed without due
   146  	attention to this detail. In FORTRAN subscripts always start at 1; in
   147  	ALGOL 60 and in PASCAL, convention c) has been adopted; the more recent
   148  	SASL has fallen back on the FORTRAN convention: a sequence in SASL is at
   149  	the same time a function on the positive integers. Pity! (End of
   150  	Remark.)
   151  
   152  				*                *
   153  					*
   154  
   155  	The above has been triggered by a recent incident, when, in an emotional
   156  	outburst, one of my mathematical colleagues at the University —not a
   157  	computing scientist— accused a number of younger computing scientists of
   158  	"pedantry" because —as they do by habit— they started numbering at zero.
   159  	He took consciously adopting the most sensible convention as a
   160  	provocation. (Also the "End of ..." convention is viewed of as
   161  	provocative; but the convention is useful: I know of a student who
   162  	almost failed at an examination by the tacit assumption that the
   163  	questions ended at the bottom of the first page.) I think Antony Jay is
   164  	right when he states: "In corporate religions as in others, the heretic
   165  	must be cast out not because of the probability that he is wrong but
   166  	because of the possibility that he is right."
   167  
   168  
   169  	Plataanstraat 5		11 August 1982
   170  	5671 AL NUENEN		prof.dr. Edsger W. Dijkstra
   171  	The Netherlands		Burroughs Research Fellow
   172  
   173  
   174  Quality Scores
   175  
   176  Quality scores are supported for all sequence types, including protein. Phred
   177  and Solexa scoring systems are able to be read from files, however internal
   178  representation of quality scores is with Phred, so there will be precision loss
   179  in conversion. A Solexa quality score type is provided for use where this will
   180  be a problem.
   181  
   182  Copyright ©2011-2012 The bíogo Authors except where otherwise noted. All rights
   183  reserved. Use of this source code is governed by a BSD-style license that can be
   184  found in the LICENSE file.
   185  */
   186  package biogo