github.com/balzaczyy/golucene@v0.0.0-20151210033525-d0be9ee89713/core/codec/lucene41/storedFieldsFormat.go

github.com/balzaczyy/golucene@v0.0.0-20151210033525-d0be9ee89713/core/codec/lucene41/storedFieldsFormat.go (about)

     1  package lucene41
     2  
     3  import (
     4  	"github.com/balzaczyy/golucene/core/codec/compressing"
     5  )
     6  
     7  // lucene41/Lucene41StoredFieldsFormat.java
     8  
     9  /*
    10  Lucene 4.1 stored fields format.
    11  
    12  Principle
    13  
    14  This StoredFieldsFormat compresses blocks of 16KB of documents in
    15  order to improve the compression ratio compared to document-level
    16  compression. It uses the LZ4 compression algorithm, which is fast to
    17  compress and very fast to decompress dta. Although the compression
    18  method that is used focuses more on speed than on compression ratio,
    19  it should provide interesting compression ratios for redundant inputs
    20  (such as log files, HTML or plain text).
    21  
    22  File formats
    23  
    24  Stored fields are represented by two files:
    25  
    26  1. field_data
    27  
    28  A fields data file (extension .fdt). This file stores a compact
    29  representation of documents in compressed blocks of 16KB or more.
    30  When writing a segment, documents are appended to an in-memory []byte
    31  buffer. When its size reaches 16KB or more, some metadata about the
    32  documents is flushed to disk, immediately followed by a compressed
    33  representation of the buffer using the [LZ4](http://codec.google.com/p/lz4/)
    34  [compression format](http://fastcompression.blogspot.ru/2011/05/lz4-explained.html)
    35  
    36  Here is a more detailed description of the field data fiel format:
    37  
    38  - FieldData (.dft) --> <Header>, packedIntsVersion, <Chunk>^ChunkCount
    39  - Header --> CodecHeader
    40  - PackedIntsVersion --> PackedInts.VERSION_CURRENT as a VInt
    41  - ChunkCount is not known in advance and is the number of chunks
    42  nucessary to store all document of the segment
    43  - Chunk --> DocBase, ChunkDocs, DocFieldCounts, DocLengths, <CompressedDoc>
    44  - DocBase --> the ID of the first document of the chunk as a VInt
    45  - ChunkDocs --> the number of the documents in the chunk as a VInt
    46  - DocFieldCounts --> the number of stored fields or every document
    47  in the chunk,  encoded as followed:
    48    - if hunkDocs=1, the unique value is encoded as a VInt
    49    - else read VInt (let's call it bitsRequired)
    50      - if bitsRequired is 0 then all values are equal, and the common
    51      value is the following VInt
    52      - else bitsRequired is the number of bits required to store any
    53      value, and values are stored in a packed array where every value
    54      is stored on exactly bitsRequired bits
    55  - DocLenghts --> the lengths of all documents in the chunk, encodedwith the same method as DocFieldCounts
    56  - CompressedDocs --> a compressed representation of <Docs> using
    57  the LZ4 compression format
    58  - Docs --> <Doc>^ChunkDocs
    59  - Doc --> <FieldNumAndType, Value>^DocFieldCount
    60  - FieldNumAndType --> a VLong, whose 3 last bits are Type and other
    61  bits are FieldNum
    62  - Type -->
    63    - 0: Value is string
    64    - 1: Value is BinaryValue
    65    - 2: Value is int
    66    - 3: Value is float32
    67    - 4: Value is int64
    68    - 5: Value is float64
    69    - 6, 7: unused
    70  - FieldNum --> an ID of the field
    71  - Value --> string | BinaryValue | int | float32 | int64 | float64
    72  dpending on Type
    73  - BinaryValue --> ValueLength <Byte>&ValueLength
    74  
    75  Notes
    76  
    77  - If documents are larger than 16KB then chunks will likely contain
    78  only one document. However, documents can never spread across several
    79  chunks (all fields of a single document are in the same chunk).
    80  - When at least one document in a chunk is large enough so that the
    81  chunk is larger than 32KB, then chunk will actually be compressed in
    82  several LZ4 blocks of 16KB. This allows StoredFieldsVisitors which
    83  are only interested in the first fields of a document to not have to
    84  decompress 10MB of data if the document is 10MB, but only 16KB.
    85  - Given that the original lengths are written in the metadata of the
    86  chunk, the decompressorcan leverage this information to stop decoding
    87  as soon as enough data has been decompressed.
    88  - In case documents are incompressible, CompressedDocs will be less
    89  than 0.5% larger than Docs.
    90  
    91  2. field_index
    92  
    93  A fields index file (extension .fdx).
    94  
    95  - FieldsIndex (.fdx) --> <Header>, <ChunkINdex>
    96  - Header --> CodecHeader
    97  - ChunkIndex: See CompressingStoredFieldsInexWriter
    98  
    99  Known limitations
   100  
   101  This StoredFieldsFormat does not support individual documents larger
   102  than (2^32 - 2^14) bytes. In case this is a problem, you should use
   103  another format, such as Lucene40StoredFieldsFormat.
   104  */
   105  type Lucene41StoredFieldsFormat struct {
   106  	*compressing.CompressingStoredFieldsFormat
   107  }
   108  
   109  func NewLucene41StoredFieldsFormat() *Lucene41StoredFieldsFormat {
   110  	return &Lucene41StoredFieldsFormat{
   111  		compressing.NewCompressingStoredFieldsFormat("Lucene41StoredFields", "", compressing.COMPRESSION_MODE_FAST, 1<<14),
   112  	}
   113  }