github.com/balzaczyy/golucene@v0.0.0-20151210033525-d0be9ee89713/core/codec/lucene41/storedFieldsFormat.go (about) 1 package lucene41 2 3 import ( 4 "github.com/balzaczyy/golucene/core/codec/compressing" 5 ) 6 7 // lucene41/Lucene41StoredFieldsFormat.java 8 9 /* 10 Lucene 4.1 stored fields format. 11 12 Principle 13 14 This StoredFieldsFormat compresses blocks of 16KB of documents in 15 order to improve the compression ratio compared to document-level 16 compression. It uses the LZ4 compression algorithm, which is fast to 17 compress and very fast to decompress dta. Although the compression 18 method that is used focuses more on speed than on compression ratio, 19 it should provide interesting compression ratios for redundant inputs 20 (such as log files, HTML or plain text). 21 22 File formats 23 24 Stored fields are represented by two files: 25 26 1. field_data 27 28 A fields data file (extension .fdt). This file stores a compact 29 representation of documents in compressed blocks of 16KB or more. 30 When writing a segment, documents are appended to an in-memory []byte 31 buffer. When its size reaches 16KB or more, some metadata about the 32 documents is flushed to disk, immediately followed by a compressed 33 representation of the buffer using the [LZ4](http://codec.google.com/p/lz4/) 34 [compression format](http://fastcompression.blogspot.ru/2011/05/lz4-explained.html) 35 36 Here is a more detailed description of the field data fiel format: 37 38 - FieldData (.dft) --> <Header>, packedIntsVersion, <Chunk>^ChunkCount 39 - Header --> CodecHeader 40 - PackedIntsVersion --> PackedInts.VERSION_CURRENT as a VInt 41 - ChunkCount is not known in advance and is the number of chunks 42 nucessary to store all document of the segment 43 - Chunk --> DocBase, ChunkDocs, DocFieldCounts, DocLengths, <CompressedDoc> 44 - DocBase --> the ID of the first document of the chunk as a VInt 45 - ChunkDocs --> the number of the documents in the chunk as a VInt 46 - DocFieldCounts --> the number of stored fields or every document 47 in the chunk, encoded as followed: 48 - if hunkDocs=1, the unique value is encoded as a VInt 49 - else read VInt (let's call it bitsRequired) 50 - if bitsRequired is 0 then all values are equal, and the common 51 value is the following VInt 52 - else bitsRequired is the number of bits required to store any 53 value, and values are stored in a packed array where every value 54 is stored on exactly bitsRequired bits 55 - DocLenghts --> the lengths of all documents in the chunk, encodedwith the same method as DocFieldCounts 56 - CompressedDocs --> a compressed representation of <Docs> using 57 the LZ4 compression format 58 - Docs --> <Doc>^ChunkDocs 59 - Doc --> <FieldNumAndType, Value>^DocFieldCount 60 - FieldNumAndType --> a VLong, whose 3 last bits are Type and other 61 bits are FieldNum 62 - Type --> 63 - 0: Value is string 64 - 1: Value is BinaryValue 65 - 2: Value is int 66 - 3: Value is float32 67 - 4: Value is int64 68 - 5: Value is float64 69 - 6, 7: unused 70 - FieldNum --> an ID of the field 71 - Value --> string | BinaryValue | int | float32 | int64 | float64 72 dpending on Type 73 - BinaryValue --> ValueLength <Byte>&ValueLength 74 75 Notes 76 77 - If documents are larger than 16KB then chunks will likely contain 78 only one document. However, documents can never spread across several 79 chunks (all fields of a single document are in the same chunk). 80 - When at least one document in a chunk is large enough so that the 81 chunk is larger than 32KB, then chunk will actually be compressed in 82 several LZ4 blocks of 16KB. This allows StoredFieldsVisitors which 83 are only interested in the first fields of a document to not have to 84 decompress 10MB of data if the document is 10MB, but only 16KB. 85 - Given that the original lengths are written in the metadata of the 86 chunk, the decompressorcan leverage this information to stop decoding 87 as soon as enough data has been decompressed. 88 - In case documents are incompressible, CompressedDocs will be less 89 than 0.5% larger than Docs. 90 91 2. field_index 92 93 A fields index file (extension .fdx). 94 95 - FieldsIndex (.fdx) --> <Header>, <ChunkINdex> 96 - Header --> CodecHeader 97 - ChunkIndex: See CompressingStoredFieldsInexWriter 98 99 Known limitations 100 101 This StoredFieldsFormat does not support individual documents larger 102 than (2^32 - 2^14) bytes. In case this is a problem, you should use 103 another format, such as Lucene40StoredFieldsFormat. 104 */ 105 type Lucene41StoredFieldsFormat struct { 106 *compressing.CompressingStoredFieldsFormat 107 } 108 109 func NewLucene41StoredFieldsFormat() *Lucene41StoredFieldsFormat { 110 return &Lucene41StoredFieldsFormat{ 111 compressing.NewCompressingStoredFieldsFormat("Lucene41StoredFields", "", compressing.COMPRESSION_MODE_FAST, 1<<14), 112 } 113 }