github.com/cellofellow/gopkg@v0.0.0-20140722061823-eec0544a62ad/database/leveldb.chai2010/doc/impl.html (about)

     1  <!DOCTYPE html>
     2  <html>
     3  <head>
     4  <link rel="stylesheet" type="text/css" href="doc.css" />
     5  <title>Leveldb file layout and compactions</title>
     6  </head>
     7  
     8  <body>
     9  
    10  <h1>Files</h1>
    11  
    12  The implementation of leveldb is similar in spirit to the
    13  representation of a single
    14  <a href="http://labs.google.com/papers/bigtable.html">
    15  Bigtable tablet (section 5.3)</a>.
    16  However the organization of the files that make up the representation
    17  is somewhat different and is explained below.
    18  
    19  <p>
    20  Each database is represented by a set of files stored in a directory.
    21  There are several different types of files as documented below:
    22  <p>
    23  <h2>Log files</h2>
    24  <p>
    25  A log file (*.log) stores a sequence of recent updates.  Each update
    26  is appended to the current log file.  When the log file reaches a
    27  pre-determined size (approximately 4MB by default), it is converted
    28  to a sorted table (see below) and a new log file is created for future
    29  updates.
    30  <p>
    31  A copy of the current log file is kept in an in-memory structure (the
    32  <code>memtable</code>).  This copy is consulted on every read so that read
    33  operations reflect all logged updates.
    34  <p>
    35  <h2>Sorted tables</h2>
    36  <p>
    37  A sorted table (*.sst) stores a sequence of entries sorted by key.
    38  Each entry is either a value for the key, or a deletion marker for the
    39  key.  (Deletion markers are kept around to hide obsolete values
    40  present in older sorted tables).
    41  <p>
    42  The set of sorted tables are organized into a sequence of levels.  The
    43  sorted table generated from a log file is placed in a special <code>young</code>
    44  level (also called level-0).  When the number of young files exceeds a
    45  certain threshold (currently four), all of the young files are merged
    46  together with all of the overlapping level-1 files to produce a
    47  sequence of new level-1 files (we create a new level-1 file for every
    48  2MB of data.)
    49  <p>
    50  Files in the young level may contain overlapping keys.  However files
    51  in other levels have distinct non-overlapping key ranges.  Consider
    52  level number L where L >= 1.  When the combined size of files in
    53  level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2,
    54  ...), one file in level-L, and all of the overlapping files in
    55  level-(L+1) are merged to form a set of new files for level-(L+1).
    56  These merges have the effect of gradually migrating new updates from
    57  the young level to the largest level using only bulk reads and writes
    58  (i.e., minimizing expensive seeks).
    59  
    60  <h2>Manifest</h2>
    61  <p>
    62  A MANIFEST file lists the set of sorted tables that make up each
    63  level, the corresponding key ranges, and other important metadata.
    64  A new MANIFEST file (with a new number embedded in the file name)
    65  is created whenever the database is reopened.  The MANIFEST file is
    66  formatted as a log, and changes made to the serving state (as files
    67  are added or removed) are appended to this log.
    68  <p>
    69  <h2>Current</h2>
    70  <p>
    71  CURRENT is a simple text file that contains the name of the latest
    72  MANIFEST file.
    73  <p>
    74  <h2>Info logs</h2>
    75  <p>
    76  Informational messages are printed to files named LOG and LOG.old.
    77  <p>
    78  <h2>Others</h2>
    79  <p>
    80  Other files used for miscellaneous purposes may also be present
    81  (LOCK, *.dbtmp).
    82  
    83  <h1>Level 0</h1>
    84  When the log file grows above a certain size (1MB by default):
    85  <ul>
    86  <li>Create a brand new memtable and log file and direct future updates here
    87  <li>In the background:
    88  <ul>
    89  <li>Write the contents of the previous memtable to an sstable
    90  <li>Discard the memtable
    91  <li>Delete the old log file and the old memtable
    92  <li>Add the new sstable to the young (level-0) level.
    93  </ul>
    94  </ul>
    95  
    96  <h1>Compactions</h1>
    97  
    98  <p>
    99  When the size of level L exceeds its limit, we compact it in a
   100  background thread.  The compaction picks a file from level L and all
   101  overlapping files from the next level L+1.  Note that if a level-L
   102  file overlaps only part of a level-(L+1) file, the entire file at
   103  level-(L+1) is used as an input to the compaction and will be
   104  discarded after the compaction.  Aside: because level-0 is special
   105  (files in it may overlap each other), we treat compactions from
   106  level-0 to level-1 specially: a level-0 compaction may pick more than
   107  one level-0 file in case some of these files overlap each other.
   108  
   109  <p>
   110  A compaction merges the contents of the picked files to produce a
   111  sequence of level-(L+1) files.  We switch to producing a new
   112  level-(L+1) file after the current output file has reached the target
   113  file size (2MB).  We also switch to a new output file when the key
   114  range of the current output file has grown enough to overlap more then
   115  ten level-(L+2) files.  This last rule ensures that a later compaction
   116  of a level-(L+1) file will not pick up too much data from level-(L+2).
   117  
   118  <p>
   119  The old files are discarded and the new files are added to the serving
   120  state.
   121  
   122  <p>
   123  Compactions for a particular level rotate through the key space.  In
   124  more detail, for each level L, we remember the ending key of the last
   125  compaction at level L.  The next compaction for level L will pick the
   126  first file that starts after this key (wrapping around to the
   127  beginning of the key space if there is no such file).
   128  
   129  <p>
   130  Compactions drop overwritten values.  They also drop deletion markers
   131  if there are no higher numbered levels that contain a file whose range
   132  overlaps the current key.
   133  
   134  <h2>Timing</h2>
   135  
   136  Level-0 compactions will read up to four 1MB files from level-0, and
   137  at worst all the level-1 files (10MB).  I.e., we will read 14MB and
   138  write 14MB.
   139  
   140  <p>
   141  Other than the special level-0 compactions, we will pick one 2MB file
   142  from level L.  In the worst case, this will overlap ~ 12 files from
   143  level L+1 (10 because level-(L+1) is ten times the size of level-L,
   144  and another two at the boundaries since the file ranges at level-L
   145  will usually not be aligned with the file ranges at level-L+1).  The
   146  compaction will therefore read 26MB and write 26MB.  Assuming a disk
   147  IO rate of 100MB/s (ballpark range for modern drives), the worst
   148  compaction cost will be approximately 0.5 second.
   149  
   150  <p>
   151  If we throttle the background writing to something small, say 10% of
   152  the full 100MB/s speed, a compaction may take up to 5 seconds.  If the
   153  user is writing at 10MB/s, we might build up lots of level-0 files
   154  (~50 to hold the 5*10MB).  This may signficantly increase the cost of
   155  reads due to the overhead of merging more files together on every
   156  read.
   157  
   158  <p>
   159  Solution 1: To reduce this problem, we might want to increase the log
   160  switching threshold when the number of level-0 files is large.  Though
   161  the downside is that the larger this threshold, the more memory we will
   162  need to hold the corresponding memtable.
   163  
   164  <p>
   165  Solution 2: We might want to decrease write rate artificially when the
   166  number of level-0 files goes up.
   167  
   168  <p>
   169  Solution 3: We work on reducing the cost of very wide merges.
   170  Perhaps most of the level-0 files will have their blocks sitting
   171  uncompressed in the cache and we will only need to worry about the
   172  O(N) complexity in the merging iterator.
   173  
   174  <h2>Number of files</h2>
   175  
   176  Instead of always making 2MB files, we could make larger files for
   177  larger levels to reduce the total file count, though at the expense of
   178  more bursty compactions.  Alternatively, we could shard the set of
   179  files into multiple directories.
   180  
   181  <p>
   182  An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows
   183  the following timings to do 100K file opens in directories with
   184  varying number of files:
   185  <table class="datatable">
   186  <tr><th>Files in directory</th><th>Microseconds to open a file</th></tr>
   187  <tr><td>1000</td><td>9</td>
   188  <tr><td>10000</td><td>10</td>
   189  <tr><td>100000</td><td>16</td>
   190  </table>
   191  So maybe even the sharding is not necessary on modern filesystems?
   192  
   193  <h1>Recovery</h1>
   194  
   195  <ul>
   196  <li> Read CURRENT to find name of the latest committed MANIFEST
   197  <li> Read the named MANIFEST file
   198  <li> Clean up stale files
   199  <li> We could open all sstables here, but it is probably better to be lazy...
   200  <li> Convert log chunk to a new level-0 sstable
   201  <li> Start directing new writes to a new log file with recovered sequence#
   202  </ul>
   203  
   204  <h1>Garbage collection of files</h1>
   205  
   206  <code>DeleteObsoleteFiles()</code> is called at the end of every
   207  compaction and at the end of recovery.  It finds the names of all
   208  files in the database.  It deletes all log files that are not the
   209  current log file.  It deletes all table files that are not referenced
   210  from some level and are not the output of an active compaction.
   211  
   212  </body>
   213  </html>